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Abstract 

The most fundamental problem in statistics is the inference of an unknown probability distri- 
bution from a finite number of samples. For a specific observed data set, answers to the following 
questions would be desirable: (1) Estimation: Which candidate distribution provides the best 
fit to the observed data?, (2) Goodness-of-fit: How concordant is this distribution with the ob- 
served data?, and (3) Uncertainty: How concordant are other candidate distributions with the 
observed data? A simple unified approach for univariate data that addresses these traditionally 
distinct statistical notions is presented called "maximum fidelity". Maximum fidelity is a fre- 
quentist approach (in the strict manner of Peirce and Wilson) that is fundamentally based on 
model concordance with the observed data. The fidelity statistic is a general information mea- 
sure based on the coordinate-independent cumulative distribution and critical yet previously 
neglected symmetry considerations. A highly accurate gamma-function approximation for the 
distribution of the fidelity under the null hypothesis (valid for any number of data points) allows 
direct conversion of fidelity to absolute model concordance (p value), permitting immediate com- 
putation of the concordance landscape over the entire model parameter space. Maximization of 
the fidelity allows identification of the most concordant model distribution, generating a method 
for parameter estimation. Neighboring, less concordant distributions provide the "uncertainty" 
in this estimate. Detailed comparisons with other well-established statistical approaches reveal 
that maximum fidelity provides an optimal approach for parameter estimation (superior to max- 
imum likelihood) and a generally optimal approach for goodness-of-fit assessment of arbitrary 
models applied to any number of univariate data points distributed on the circle or the line. Ex- 
tensions of this approach to binary data, binned data, and multidimensional data arc described, 
along with improved versions of classical parametric and nonparametric statistical tests. Max- 
imum fidelity provides a philosophically consistent, robust, and seemingly optimal foundation 
for statistical inference. All findings in this manuscript are presented in an elementary way in 
order to be immediately accessible to researchers in the sciences, social sciences, and all other 
fields utilizing statistical analysis (medicine, engineering, finance, etc.). 
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Researchers seeking simple methods to properly analyze their data are instead confronted by a 
bewildering miscellany of statistical tools and approaches. This is primarily due to the existence 
of multiple distinct and irreconcilable philosophies of statistical inference. The research com- 
munity would greatly benefit by the establishment of a single, uncontroversial, philosophically 
consistent, and sufficiently general approach for the optimal estimation of model parameters 
(and their "uncertainty") and the assessment of goodness-of-fit for arbitrary models applied 
to data. In this manuscript, I present an approach called "maximum fidelity" that appears to 
uniquely achieve these aims for the analysis of univariate data. For multivariate data, maximum 
fidelity allows for a proper understanding of why it is impossible to generate a single unique 
approach; however, after a generic choice of convention, maximum fidelity can also be extended 
to the analysis of multidimensional data. 

The fidelity is an information-based statistic derived from the cumulative distribution (the 
probability integral transform'^), the latter only uniquely defined for univariate data. Maxi- 
mization of the fidelity leads to an optimal means of parameter estimation for arbitrary models 
applied to univariate data. A simple conversion of the fidelity to its associated p value (under 
the null hypothesis) allows for efficient, intuitive, and generally optimal estimation of model 
concordance. Similar concordance evaluation for neighboring models in the parameter space 
permits a notion of parameter "uncertainty" . While maximum fidelity is only uniquely defined 
for univariate data (due to its reliance on the cumulative distribution) , its application to higher 
dimensional data is also possible upon transformation of multidimensional data to univariate 
data through invocation of appropriate model and/or coordinate system symmetries (referred 
to below as "inverse Monte Carlo"). 

An illustrative example of the use of the fidelity for determining the level of concordance of a 
Gaussian model with a particular data set is shown in Fig. [T] Depicted are three observed data 
points: x%, X2, and £3. A Gaussian model is hypothesized to fit these data with its probability 
distribution shown in blue and its corresponding cumulative distribution in black. The three 
data points map to model-dependent cumulative values, from which the fidelity can be deter- 
mined, which is the sum of the logarithm of the cumulative spacings between the points with a 
critical additional factor-of-two at the boundaries. Using a simple yet highly accurate gamma- 
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Figure 1: Example of the use of the fidelity for testing the concordance of a hypothesized Gaussian model (xo = 4, 
a = 1) with three observed data points. Here, the peak-normalized Gaussian probability distribution (blue) is also 
expressed in terms of its cumulative distribution (black), which maps the points Xi to their corresponding cumulative 
values Ci. The fidelity based on these particular Ci values is / = —0.159, which can be immediately converted to 
its corresponding p value through a highly-accurate gamma-function approximation, yielding p — 0.809 (with the 
standard notion of p > 0.05 indicating a good fit). 



function approximation of the fidelity distribuiton under the null hypothesis, the fidelity can be 
immediately converted to concordance (p value), allowing for an absolute assessment of model 
goodness-of-fit to the data. While a Gaussian model was chosen for this example, maximum 
fidelity is equally applicable to arbitrary probability distributions. 

Maximum fidelity represents a significant departure from traditional approaches to statistical 
inference. Many traditional approaches are ultimately premised on a logical fallacy concerning 
the notion of probability, as Charles Sanders Peirce (1839-1914) already perceptively noted over 
a century ago: 

"The relative probability of this or that arrangement of Nature is something which 
we should have a right to talk about if universes were as plenty as blackberries, if we 
could put a quantity of them in a bag, shake them well up, draw out a sample, and 
examine them to see what proportion of them had one arrangement and what propor- 
tion another. But, even in that higher universe would contain us, in regard 
to whose arrangements the conception of probability could have no applicability." 
(Peirce 2.684, i.e., volume 2, section 684 of his collected papers)^ 

In traditional approaches, this probability fallacy is made from the outset of the statistical 
inference (e.g. inverse probability or Bayesian methods'^', upon the conclusion of the statistical 
inference (e.g. various probabilistic interpretations of the probable error or of the confidence 
interval), or as justification of a particular statistic (frequentist coverage probability). A related 
fallacy (or unnecessary limitation) made in many traditional approaches is the restriction of 
consideration to a particular parametrized family of distributions (parameter fallacy); Bayesian 
and most frequentist approaches absolutely require this ad hoc restriction. Many statistical 
arguments are also only valid for a particular choice of coordinate system (coordinate fallacy), 
as they rely on coordinate-dependent moments like the mean and standard deviation of the data 
(e.g. frequentist confidence interval for a Gaussian based on the data mean) or the mean and 
standard deviation of statistics associated with the model parameters of a specific distribution 
family (e.g., the Fisher information^ and the related Cramer- Rao efficiency^!) . For example, a 
statistic that achieves the Cramer-Rao efficiency for estimation of a for a Gaussian distribution 
will not be efficient with respect to estimation of the variance v — a 2 . Equally troubling, for 
every statistic that asymptotically satisfies the Cramer-Rao efficiency, another statistic can 
always be found that is even more efficient (so-called superefficiency'i^). And, again, what if the 
underlying data are not actually drawn from a member of the distribution family assumed for 
these efficiency calculations (parameter fallacy)! Many statistical approaches are based on proofs 
that are only valid in the asymptotic limit of large numbers of data points (asymptotic fallacy, 
e.g. the Cramer- Rao efficiency), which yields no insight into their reliability on the smaller data 
sets for which statistical analysis is most relevant. Another significant fallacy is the use of the 
likelihood function and likelihood ratio (likelihood fallacy). The likelihood itself is a coordinate- 
dependent quantity. While the likelihood ratio is coordinate independent, its use mandates a 
statistics based merely on comparison of the relative "quality of fit" of one model distribution 
to another from the same family, which provides no information about the absolute concordance 
of each individual model with the observed data (e.g. in the sense of Pearson's x 2 ~P testPJ) and 
prevents meaningful cross-comparison of models from different distribution families. Parameter 
estimation via maximization of the likelihood can also fail for certain, otherwise well-defined 
distribution families'^^. 

The specific aspects to which a proper approach to statistical inference should conform were 
outlined by Peirce in his extensive writings^, particularly in his unorthodox yet well-motivated 
definition of induction: 

"Induction is the experimental testing of a theory. The justification of it is that, 
although the conclusion at any stage of the investigation may be more or less erro- 
neous, yet the further application of the same method must correct the error. The 



only thing that induction accomplishes is to determine the value of a quantity. It 
sets out with a theory and it measures the degree of concordance of that theory with 
fact." (Peirce 5.145)^1 

Pence' s central notion of determining the "degree of concordance" of a theory will serve as a guid- 
ing light throughout this manuscript. An inductive inference must avoid the above-mentioned 
probability fallacy in order to be valid: 

"Every argument or inference professes to conform to a general method or type of 
reasoning, which method, it is held, has one kind of virtue or another in producing 
truth. In order to be valid the argument or inference must really pursue the method 
it professes to pursue, and furthermore, that method must have the kind of truth- 
producing virtue which it is supposed to have. For example, an induction may conform 
to the formula of induction; but it may be conceived, and often is conceived, that 
induction lends a probability to its conclusion. Now that is not the way in which 
induction leads to the truth. It lends no definite probability to its conclusion. It is 
nonsense to talk of the probability of a law, as if we could pick universes out of a 
grab-bag and find in what proportion of them the law held good. Therefore, such an 
induction is not valid; for it does not do what it professes to do, namely, to make 
its conclusion probable. But yet if it had only professed to do what induction does 
(namely, to commence a proceeding which must in the long run approximate to the 
truth), which is infinitely more to the purpose than what it professes, it would have 
been valid." (Peirce 2.780)^1 

Peirce's arguments have been viewed as anticipating the work of the dominant frcquentist schools 
of the twentieth century represented by Ronald Fisher, Jerzy Neyman, and Egon PearsonEsHUl 
While Peirce may have been sympathetic to some of what is contained in these approaches, his 
views are actually more closely aligned with the pragmatic viewpoint of Egon Pearson's father 
and Peirce's partial contemporary, Karl Pearson, and especially with the strict and uncontro- 
versial frequentist viewpoint espoused by Edwin Bidwell Wilson, a lifelong admirer of Peirce's 
workP^. 

The main distinctions between Karl Pearson's viewpoint and that of the later frequentist 
schools arise, on the one hand, from Pearson's general emphasis on goodness-of-fit (his inven- 
tion^) and, on the other hand, from the latter's embrace of the parameter and likelihood fallacies 
described above. Pearson clearly took a more pragmatic view on the choice of model distribution 
for fitting data (in the following, a "normal curve" is a Gaussi an) : "I have never found a normal 
curve to fit anything if there are enough observations!"E^2l Pearson elaborated his views (in 
the following, a "graduation curve" is simply an hypothesized model distribution): 

"The reader will ask: 'But if they do not represent laws of Nature, what is the value of 
graduation curves?' He might as well ask what is the value of scientific investigation! 
A good graduation curve — that is, one with an acceptable probability — is the 
only form of 'natural law', which the scientific worker, be he astronomer, physicist or 
statistician can construct. Nothing prevents its being replaced by a better graduation; 
and ever better graduation is the history of science. "I 19 l 2 ° l 

Pearson loosely uses the phrase "acceptable probability", for which he implied an acceptable 
p value derived from his \ 2 testJ^, which represented the first completely general measure of 
model concordance comparable across arbitrary distributions. Pearson clearly would not have 
at all been surprised if a "better graduation" came in the form of an arbitrarily different model 
(e.g. from a distribution family differing from the one initially considered). To illustrate the 
parameter fallacy from these different viewpoints, consider a collection of data points: x\ v . .,i n . 
In the Fisher-Neyman-E. Pearson approach, the data mean could be used to determine a fiducial 
interval (Fisher) or confidence interval (Neyman-E. Pearson) on the central value n (for fixed a) 
for a Gaussian family of models. No matter the observed distribution of the data, a confidence 



interval or "belt" (e.g., encompassing the central 95% range) can alw ays b e defined based on the 
value of the data mean (though this glosses over Neyman's paradox^ 21 ! 22 !, which is discussed in 
f|7]and resolved within the context of the fidelity). However, what if the data are not Gaussian, 
as Pearson indicates above? The lack of a goodness-of-fit measure means that this question lies 
outside the domain of the approaches developed by Fisher-Neyman-E. Pearson; as pointed out 
above, these approaches in fact can only be defined upon complete restriction of consideration 
to a specific distribution family. But such "parametrization of our ignorance" yields only a very 
limited approach to statistical inference. Such circumscribed frequentist approaches therefore do 
not appear to adequately "do what induction does (namely, to commence a proceeding which 
must in the long run approximate to the truth)" (Peirce 2.780)^. By contrast, goodness-of-fit 
measured by x 2 provides a much surer way of determining the absolute concordance of the full 
model distribution to the complete data. Minimization of x 2 a l so provides an optimal method 
for parameter estimation. However, x 2 can only be applied to binned data and works well only 
asymptotically (> 10 data points in each bin). If only there were an optimal approach similar to 
X 2 but that was valid as well for data sets of arbitrary size n (something sought by Pearson as 
well^. In this manuscript, I will argue that maximum fidelity represents just such an approach. 

The main distinction between Wilson's viewpoint and that of the Fisher-Neyman-E. Pearson 
school can be found in their various notions of a confidence interval. Wilson was credited by 
Neyman^ as the originator of the concept of confidence intervals based on Wilson's seminal 
paper "On Probable Inference" from 1927^ however, Wilson demurred for reasons that will 
become clear below^. As this controversy is critical for understanding the viewpoint of the 
current manuscript, it is necessary to delve a bit deeper. In his 1927 paper, Wilson addresses 
inference of the true success rate, p, of a binary sequence of n events with an observed success 
ratio of po- The conventional view was that: 

"The probability that the true value of the rate p lies outside its limits po — Xuq and 
Po + Acr is less than or equal to p A ."H3 

P\ is a probability that decreases for increasing A. Wilson criticized this view: 

"Strictly speaking, the usual statement of probable inference as given above is ellip- 
tical. Really the chance that the true probability p lies outside a specified range is 
either or 1; for p actually lies within that range or does not. It is the observed rate 
Po which has a greater or less chance of lying within a certain interval of the true 
rate p. If the observer has had the hard luck to have observed a relatively rare event 
and to have based his inference thereon, he may be fairly wide of the mark."^ 

Wilson developed his own statistic for describing such binary processes (by ordering the outcomes 
according to their individual probabilities for an assumed p, akin to a likelihood approach, albeit 
a coordinate-independent one due to the discrete nature of the data) and importantly discussed 
its interpretation as follows: 

"The rule then may be stated as: If the true value of the probability p lies outside 
the range . . . [Wilson states the range of his interval in terms of n, p , and A] . . . the 
chance of having such hard luck as to have made an observation so bad as p is less 
or equal to Pa- And this form of statement is not elliptical. It is the proper form of 
probable inference. 

In 1934, Clopper & E. Pearson^ published their own interpretation of the confidence inter- 
val for such a binary process (Neyman cited their work as confirmation of his more general 
arguments^). Clopper-E. Pearson use the model cumulative distribution to define an exact, 
central interval over which a confidence level or, more accurately, 2D "belt" (e.g. correspond- 
ing to 95%) could be assigned based on the assumption of an independent Bernoulli process 
for the events^. Wilson criticized E. Pearson's probabilistic viewpoint in his 1942 paper "On 
Confidence Intervals"^. Wilson first restated his view: 



"I was trying to emphasize that we know nothing about the value of p, which must 
have whatever value it did have in the universe from which the sample was drawn, but 
that we could set limits based on probability calculations such that if p lay between 
them the chance of getting the particular observation or any less probable one would 
exceed some preassigned value P whereas if p lay outside them the chance of getting 
the observation or any less probable one would be less than P."H3 

Wilson implied that Clopper and E. Pearson's probabilistic interpretation went too far: 

"Furthermore Clopper and Pearson (and Rietz) introduce the notion of confidence 
belts, which I did not have in mind, and state: 'We cannot therefore say that for any 
specified value of x(— npo) the probability that the confidence interval will include 
p is 0.95 or more. The probability must be associated with the whole belt, that is 
to say with the result of the continued application of a method of procedure to all 
values of x met with in our statistical experience' — a statement I should hesitate 
to make."l23 

Note that Wilson always speaks of a limit on the chance of obtaining the observed po based on 
the assumption of a particular value of p 7 whereas Clopper-E. Pearson make a statement about 
the probability of p lying within the confidence belt based on the observed value pq. Based on 
the implications of both his and their confidence intervals, Wilson concluded: 

"Thus although confidence intervals are based on probabilities, it is not certain that 
probabilities are based on them."^ 

Wilson's hesitation to conclude anything about the probability distribution of p was prescient 
given the modern nuanced distinction between the confidence level and the coverage probabil- 
ity^- In 1964, Wilson deflected Neyman's praise for developing the notion of the confidence 
interval on the following basis: 

"Long ago I suggested not making any guess as to the value of an unknown [binary] 
probability [parameter] , but rather to consider what could be inferred from it or what 
value it would have to have to make some function of it take some assigned value. 

And in the footnote to this statement: 

"Mr. Neyman, on page 222 of his Lectures and Conferences on Mathematical Statistics 
and Probability (USDA, 1952), suggested that I then had the idea of confidence 
intervals. I would make no such claim. I was merely trying to say that for logical 
reasons statisticians should treat an unknown probability as unknown and in using a 
standard deviation should allow for any Lexian ratio that rates of the sort they were 
considering might be expected to show."^ 

Wilson is here implicitly criticizing Neyman's (and E. Pearson's) viewpoint in two ways. First, by 
unequivocally stating that "for logical reasons statisticians should treat an unknown probability 
as unknown" (here, "probability" clearly refers to the Bernoulli parameter p, not the probability 
distribution of p, but Wilson would have also agreed on the latter interpretation) and, second, 
by stressing the necessity of avoiding the parameter fallacy, in this case the restriction of con- 
sideration to a Bernoulli process. Binary data obtained in the real world might be drawn from 
a Bernoulli process, but they could also have been drawn from a different process, e.g. one that 
deviates from a Bernoulli process as judged by the "Lexian ratio"^. Wilson considered these 
alternatives at the end of his 1927 paper: 

"It is well known that some phenomena show less and some show more variation than 
that due to chance as determined by the Bernoulli expansion (p + q) n . The value L 
of the Lexian ratio is precisely the ratio of the observed dispersion to the value of 
(npq) 1 / 2 or {pq/n) 1 / 2 as the case may be. If we have general information which leads 
us to believe that the variation of a particular phenomenon be supernormal (L > 1), 



we naturally shall allow for some value of L in drawing the inference. Thus if the 
Lexian ratio is presumed from previous analysis of similar phenomena to be in the 
neighborhood of 5, we may use A = 10 as properly as we should use A = 2 if the 
phenomenon were believed to be normal (Bernoullian) ."^ 

Wilson here leaves open the possibility of an underlying distribution from an alternative family 
accounting for the data as well, which precludes making any definite fundamental statement 
about the probability of the overall number of successes to lie within certain bounds in the 
sense of Neyman-E. Pearson. The inference of a binary success rate from a sequence of n events 
is given a novel and more exact interpretation in <|9] within the context of the fidelity and the 
cumulative distribution. This fidelity-based approach is actually mathematically more related 
to Clopper-E. Pearson's^ through its use of the cumulative distribution, but is philosophically 
in line with Wilson's restrictive interpretation of the confidence interval. 

Peirce similarly understood that the quantities obtained through induction should not be 
interpreted as normal probabilities: 

"The theory here proposed does not assign any probability to the inductive or hypo- 
thetic conclusion, in the sense of undertaking to say how frequently that conclusion 
would be found true. It does not propose to look through all the possible universes, 
and say in what proportion of them a certain uniformity occurs; such a proceed- 
ing, were it possible, would be quite idle. The theory here presented only says how 
frequently, in this universe, the special form of induction or hypothesis would lead 
us right. The probability given by this theory is in every way different — in mean- 
ing, numerical value, and form — from that of those who would apply to ampliative 
inference the doctrine of inverse chances." (Peirce 2.748)^ 

Peirce defines this "probability" quantity more exactly in the following passage: 

"The third order of induction, which may be called Statistical Induction, differs 
entirely from the other two in that it assigns a definite value to a quantity. It draws 
a sample of a class, finds a numerical expression for a predesignate character of 
that sample and extends this evaluation, under proper qualification, to the entire 
class, by the aid of the doctrine of chances. The doctrine of chances is, in itself, 
purely deductive. It draws necessary conclusions only. The third order of induction 
takes advantage of the information thus deduced to render induction exact." (Peirce 
7.120)121 

Peirce based his views largely upon the similar consideration of inference on binary data sets (e.g. 
Peirce 2.687)^, which certainly influenced Wilson's formulation of "a proper form of probable 
inference"™] 

The approach presented here, called maximum fidelity, avoids the probability fallacy and 
the parameter fallacy — the latter the mathematical crutch of both Bayesian and most con- 
ventional frequentist approaches to inference — as well as all of the other fallacies or ad hoc 
limitations mentioned above. Maximum fidelity represents a literal interpretation of Peirce's 
and Wilson's guidelines, which have heretofore not been fully appreciated. Maximum fidelity 
"sets out with a theory" (a particular hypothesized probability distribution) and "measures 
the degree of concordance of that theory with fact" (Peirce 5.145)121, obtaining the fidelity and 
associated concordance as universal "predesignate characters" of the data set of arbitrary size 
n. The fidelity is a seemingly optimal information-theoretic statistic defined by the model-based 
cumulative mapping of the observed data points and important symmetry considerations. Its 
basis in the cumulative mapping importantly makes it independent of the choice of coordinate 
system for the data (avoiding the coordinate fallacy). Maximization of the fidelity yields the 
particular candidate model that is most concordant with the data, viz., the model that best 
represents the complete "information" contained in the data. Maximum fidelity unifies several 
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Figure 2: Comparison of traditional statistical approaches with maximum fidelity. 



typically distinct notions, including Peirce's foundational work on scientific and statistical in- 
ference, the cumulative distribution, information theory, general mathematical considerations 
(symmetry, boundaries, dimensionality), optimal parameter estimation, and the central notion 
of model concordance. Maximum fidelity also provides a clear setting for the application of Ock- 
ham's razor, which together with model concordance are the dual, competing considerations 
upon which all scientific inference is based. 

The principal differences between maximum fidelity and traditional approaches are dis- 
played in Fig. [2j In traditional approaches, a heirarchy of "difficulty" (or "ambiguity" or "ill- 
definedness" ) is present, with parameter estimation the easiest, parameter uncertainty more 
difficult, and goodness-of-fit only possible for a very restricted class of problems. At each level, 
completely different frameworks, assumptions, and artificial limitations are typically required. 
Maximum fidelity subverts this heirarchy through its fundamental, unifying basis in model con- 
cordance (goodness-of-fit), with parameter estimation and a heavily qualified form of parameter 
"uncertainty" following naturally. 

The simplest route to understanding maximum fidelity is to see how the fidelity statistic 
is derived, how the method is applied, and how well it performs in comparison with other 
approaches. The outline of this manuscript is therefore as follows. In Sj2j a general derivation of 



the fidelity and the related fidelity statistic for univariate data on the circle and on the line is 
presented. In S|3j the superiority of maximum fidelity over all other methods — including the 
"gold standard" method of maximum likelihood — for parameter estimation on the circle and 
the line is demonstrated. In Sj4j I show how the fidelity can be converted directly to an absolute 
concordance value (p- value, derived from the null hypothesis) through use of a highly accurate 
gamma-function approximation. In this section, I also argue for the general superiority of this 
fidelity-based concordance value to other classical goodness-of-fit measures. In Sj5j joint analysis 
of independent data sets with maximum fidelity is explained. In Sj6j fidelity-based generalizations 
of Student's t test and related tests are presented. In <j7j Neyman's paradox is resolved within the 
context of the fidelity. In fjSJ the straightforward extension of maximum fidelity to binned data 
is given. In <|9j a solution to the classical problem of binary distributions within the context of 
the fidelity is given. In j ]10| the application of maximum fidelity to higher dimensional data sets 
is demonstrated based on "inverse Monte Carlo" reasoning. In an extension of maximum 
fidelity to the nonparametric (or, more accurately, model-independent) determination of whether 
two observed data sets were drawn from the same unknown distribution is proposed and shown 
to be generally superior to other classical tests. In j ]12[ I conclude with a discussion of these 
results, in particular as they pertain to general scientific inference. 

2 Derivation of the Fidelity 

The information measure I refer to as the fidelity was first introduced by Shannon^ and 
Wiener- 2 ' and later further examined by Kullback and Leibler^, after whom this measure is 
generally referred (e.g., the Kullback-Leibler divergence or Kullback-Leibler relative entropy). 
The fidelity, in fact, can be considered the most fundamental information measure, from which 
other measures can be derived. (It is important to note that my use of the term "fidelity" should 
be distinguished from Shannon's use of the same term for a different, non-unique quantity^.) 

My choice of the word fidelity to describe this information measure best reflects its role in 
comparing how well a specific distribution represents or translates the information contained in 
a reference distribution. The fidelity can best be intuited from the example of text translation. 
The fidelity of translation from English to Greek will have one value determined by the degree of 
degeneracy of the translated words (the multiple Greek words that can be substituted for a given 
word in English, as well as the occasional need to use the same Greek word to substitute for 
different English words). The fidelity will have a different value when translating from Greek to 
English, which exhibits well the importance of the directionality of the fidelity. In the simplest 
approach, the fidelity could be based on the substitution degeneracy of all the words in the 
translation dictionaries (for a particular translation direction) , with equal "weight" given to each 
word. A more informative measure would come from assigning different weights for the words 
dependent on their frequency of usage in each language. The fidelity could also hypothetically 
be applied to a particular work (e.g., translating Shakespeare's Hamlet to Greek, or translating 
Homer's Iliad to English) to attempt to quantify the translation fidelity or, in the opposite 
sense, what might have been "lost in translation". The example of translation should not be 
interpreted too literally, though, as different words often have only slightly different nuances 
of meaning that cannot be expressed quantitatively, which severely complicates the task of 
assigning a single meaningful value for the fidelity. Translation nevertheless serves as a useful 
intuitive guide to the directional quantity represented by the fidelity. 

Formulas for the entropy or related information measures are often presented in a mysterious 
or purely mathematical manner. To avoid confusion, it is important to maintain a direct intuitive 
link to what these measures actually represent (which is accomplished by sticking to definitions 
based on discrete distributions). Most information measures can be derived from considering 
the distribution of balls into slots, as shown in Fig. [3j In information theory, these balls can 
be considered as infinitesimal units of probability. In statistical mechanics, these balls can be 
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Figure 3: The probability that a ball lands into a particular bin 6 out of B total bins is denoted by Qb with the 
number of balls in that bin given by JVj,. The bins could be considered equivalent in width with a uniform probability 
Qb = 1/B (left), or of different sizes and therefore probabilities Qb (right). For display purposes, I have chosen to 
give the balls a "width" similar to the width of the bins; the balls should in actuality be considered infinitesimal. 



considered as infinitesimal probability units over the phase space, as infinitesimal units of energy 
partitioned across all degrees of freedom of the system, or indeed as literal particles falling into 
different categories. 

Consider N identical balls distributed across B bins with Nh balls in each bin b. We would 
like to calculate the frequency with which a particular distribution of the N balls Ni, iVjj, . . ., 
Ng will occur given the assumed known probabilities Qb > (53fcLi Qb — 1) f° r a single dropped 
ball to land in each bin. For bins of uniform width, Qb = 1/B (Fig. [3j left); however, we will 
assume that the bins can have arbitrary widths (Fig. [3j right). The frequency of occurence of 
a particular distribution N\, . . ., Nb can be determined from the corresponding term in the 
multinomial expansion of the iVth power of the sum of probabilities for a single ball: 
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In the limit of large Nb, we can take the logarithm and use the De Moivre-Stirling approximation: 
log AH- TV log TV - N + M^0 +0 (1/N), (4) 



to obtain: 
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In the last line, we neglect the 0(1/N b ) terms (limit of large 7V b ). Defining i? b = Nb/N, we are 
interested in the ratio, (, of the probability of a given distribution to the distribution obtained 
for Rb = Qb for all b, which amounts to subtraction of the logarithm of this distribution in the 
below expression: 
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Assuming i? b and Q b are strictly greater than 0, then, in the N — > oo limit, we can neglect the 
last term, giving simply: 
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where I have removed the variable Rb using the normalization condition, leaving B — 1 inde- 
pendent variables. The collection R\, . . ., Rb that maximizes log £q*' '"'q^' N (and therefore also 

^Qi ' 'qb' N ^ can ^ e f° un( i by setting the derivatives with respect to each R b (from b = 1 to 
-B - i) to 0: 
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which gives for each b simply: 
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Plugging these into the normalization condition gives: 
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Rb = Qb, (12) 

which, according to Eq. [9j implies that Rb = Qb for all 6=1... ..B. 

That this unique extremum is in fact a global maximum can be proven by examining the 
Hessian matrix of second derivatives as follows. The diagonal terms of the (B — 1) x (B — 1) 
Hessian of log C, are: 

9 2 , ^R lt ...,R B ;N Ar 9 f Qb . Qb\ 

0%**<Qu..;q* = N OR-b{ loS R- b - l0g R^) 

= ~ N (^ + 4A- ( 13 ) 
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All off-diagonal terms (i ^ b) are equal to: 
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= -iV^-. (14) 

The Hessian can therefore be written as the sum of a diagonal matrix and a uniform matrix in 
the following way: 
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That log C is strictly concave down (not a saddle point) at its unique extremum can be proven by 
verifying that all eigenvalues of H are negative. Taking the natural basis for the first, diagonal 
matrix leads to the trivial eigenvalues of: 

A N 



As-i = , (16) 

which are all negative (since we have assumed all Rb are strictly greater than 0), implying that 
the first matrix is negative definite. A natural set of eigenvectors for the second, uniform matrix 
can be assumed: 



vi = (1,...,1) 

v 2 = (1,-1,0,...,0) 

ii 3 = (1,0,-1,0,...,0) 



Vb-x = (1,0,..., 0,-1), (17) 



with corresponding eigenvalues: 
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Ai = -(B-l) — 
Kb 

A 2 = 



Afl-i = 0, (18) 

which implies that the second matrix is negative semi-definite. As the sum of a negative definite 
matrix and a negative semi-definite matrix, the full Hessian matrix H is negative definite. This 
proves that log £ is strictly concave down at its unique extremum, therefore establishing this 
point as a global maximum. 

At this maximum, log Cq]T^.q b ;' Rb=Qb ' ,N = or Cq^ Q ,q b ' ,Hb=Qb ' N = 1- At all other points, 



£ is exponentially suppressed as £ = e N ? with 



in the discrete case. For the continuous case (the B — > oo limit), 



implying 



V / dfer(6)log^|, (20) 
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In the above, we assume that r(b) and q(b) are sufficiently smooth (which represents an important 
assumption) to justify conversion of the sum to an integral. 

The fidelity provides a natural ordering of distributions specified by the Rb (or continuous 
distributions r(b)) in terms of those that occur most often (/ near 0) to those that occur least 
often (increasingly negative /) given the basis distribution specified by the Qb (or the continuous 
distribution q(b)). The fidelity importantly groups together differently shaped distributions that 
have the same "frequency of occurrence" . As one can observe from the above, it is best to 
define information measures over the discrete distributions Rb and Qb, as this preserves the 
meaning of the "frequency of occurrence" (the logarithmic terms in the fidelity ultimately derive 
from approximation of the factorials over discrete distributions) . This meaning is unfortunately 
obscured when information measures are defined from the start using continuous functions such 
as r(b) and q(b), which are perhaps only later discretized. 

The fidelity, as derived above, is a completely general information measure. The associated 
fidelity statistic for univariate data on the line can be derived from the following "density" 
argument. Consider a sorted collection of n points Xi, for example, drawn from a Gaussian dis- 
tribution (Fig. [4jA). For maximum likelihood, the values of the model probability distribution, 
f(xi), corresponding to each observed point Xi are all that are needed to construct the likeli- 
hood (which is just the product of all of these values). Maximum fidelity, on the other hand, is 
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Figure 4: (A) Mapping of observed data points to the unit interval by the cumulative distribution (black) of a Gaussian 
probability distribution with fi = 4 and a = 1 (blue). (B) Determination of Rb from the splitting of the weight of 
the mapped data points to the adjacent cumulative intervals, with interval sizes, Qb, determined by the model-based 
cumulative distribution mapping (the x-axis here is identical to the y-axis in A). 



based on the cumulative distribution. The observed points are mapped by the model cumulative 
distribution to the unit interval (Fig. [4pV). In order to calculate the fidelity statistic associated 
with these cumulative values, we assume that each point provides a local estimate of the distri- 
bution "density" on the cumulative interval. The only unbiased way to do this (in a way that 
preserves the local information) is to split each point in half and distribute the resulting weight 
over the neighboring left and right intervals (Fig. |4j3). This allows us to determine how much 
weight should be distributed to each model-defined cumulative interval, based upon which we 
can calculate the discrete version of the fidelity defined in Eq. [19] 

The most important thing to note in Fig. [4j3 is that the first and last interval contain only 
a half point, whereas all interior intervals contain a whole point (the sum of the contributions 
from both adjacent points). For a given hypothesized model distribution, a mapping of the 
individual points via the model cumulative distribution generates a series of cumulative values 
c\ , . . . , c„ ranging over the unit interval (Fig. |4|\) . The fidelity statistic is then determined from 
consideration of the relative weight (Rb) of each bin in each of the intervals (Q b ) created by the 



cumulative mapping of the observed data points (Fig. |4f; 
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= — log [2n(ci - 0)] + — log [2n(l - c„)] + - V log [n(c i+ i - <*)] . (22) 

t=i 

The fidelity statistic can be interpreted in the following way. The Rb values, denoting the relative 
weights in each bin (either a half point or a full point), should be considered as a particular 
outcome (e.g. giving the relative numbers of balls in each bin, as in Fig. [3]), the frequency 
of which we wish to ascertain based on the cumulative bin widths determined by the model- 
derived Qb values. Maximization of the fidelity therefore identifies the model for which the 
observed outcome has the highest "frequency of occurrence" (under our "density" assumption). 
The most optimal model would place the n data points evenly over the cumulative interval at 
c fc = (k — l/2)/n (for k — 1,. . . ,n), yielding the maximum value for the fidelity of / = 0. 

Note that the only information used by the fidelity derives from the model-based mapping 
of the discrete events to their cumulative values. All other information about the function is 
ignored. Any other functions having very different shape (large perturbations, highly oscillatory) 
yet preserving this mapping are indistinguishable in this approach. Candidate distributions that 
map the data points to the exact same cumulative values form a mapping equivalence class. 
Candidate distributions that generate the same fidelity (and therefore "frequency of occurrence" ) 
form an even larger fidelity equivalence class. 

That the boundary intervals contain only a half point in Fig. [4jB constitutes the only differ- 
ence between maximum fidelity and the related approach of maximum spacings, which attributes 
equal weight to all of the intervals. This difference is critical: For the spacings statistic, the in- 
tervals are treated as fundamental, whereas for the fidelity, the data points are fundamental. 
The maximum spacings statistic for n points on the line (superscript I for line) is: 
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The spacings statistic has its roots in the work of Moran^, who used it to test the concordance 
of the hypothesis of a Poisson process with an o bserve d data set (time series) , but it has been 
explored by many other researchers since then 13 " 15 ! 35 " I Its connection with information the- 
ory has long been recognized^. Later investigations have examined the maximization of the 
spacings statistic as a method for both parameter estimation and general goodness-of-fit as- 
sessment^ " 15 * 42 " ^. As derived above, for n data points on the line, the fidelity (with a slight 
reordering of the boundary terms) is: 

1 1 n_1 1 

f n = — log [2n(l - c n )] + — log [2n(c x - 0)] + £ - log [n(c i+1 - a)]. (24) 

i=i 

For n data points on the circle, the fidelity which is identical to the circular version of the 
spacings statistic s° , is: 



1 "~ 1 1 

/= = s c n = - log [n((l - c„) + ( Cl - 0))] + -log [n(ci+i - a)]. (25) 
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Figure 5: Optimal cumulative mappings based on maximum fidelity and maximum spacings on the circle and on the 
line for n — 3 data points. The <f> parameter indicates the arbitrariness (phase degeneracy) of the location of the 
coordinate origin on the circle. The line interval has been bent into a circle (with boundaries and 1 indicated at the 
top) to emphasize the symmetry of the solutions. 
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Figure 6: Cumulative intervals on the circle and on the line for n = 6 points. 



The difference between the optimal solutions implied by these different statistics is graphically 
displayed at the top of Fig. [5] for n = 3 points. In this figure, the cumulative intervals for the 
line are bent into a circle to emphasize the non-symmetric weighting of the boundary intervals 
for maximum spacings. The symmetric set of solutions on the circle (with phase freedom <p) and 
the single solution shown for the fidelity on the line (which is symmetric with respect to the 
boundary) can be viewed as an instance of "symmetry breaking" . A physical theory that breaks 
symmetry is one that selects a particular solution from a set of symmetric solutions. One can 
clearly see in Fig. [5] that the solution of the fidelity on the line belongs to the solution set on 
the circle, whereas the spacings solution does not correspond to any of the phase-degenerate 
solutions on the circle. In Fig. [6j n = 6 data points are plotted with arbitrary values. The 
interval at the boundary, ae, created by the first and the last point is afforded the same weight 
for maximum fidelity as the other intervals both on the circle and on the line. On the circle, a.Q 
is directly considered (as there is no real boundary on the circle); however, on the line, separate 
consideration is required of the two subintervals, j3\ and /3 2 , which only together contribute a 
weight equal to the non-boundary intervals. 
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Table 1: Probability distributions on the circle. 



3 Parameter Estimation 

In the following sections, parameter estimation of probability distributions on the circle and on 
the line using maximum fidelity is compared with other well-established approaches, including 
the "gold standard" method of maximum likelihood. I will focus in particular on parameter 
estimation on low n data sets, for which statistical analysis is truly necessary and discrimination 
among the various estimation methods can be easily visualized. Invocation of the Cramer-Rao 
efficiency to "prove" that one estimator is better than another is avoided for the reasons listed 
m jjTJ and is anyway invalid for the low n data sets (typically n = 5) considered below. Other 
values for n were tested (not shown) with similar results, proving their generality. Estimate bias 
with respect to the mean of the parameter estimate distribution is often discussed as a measure 
of optimality; however, this is completely dependent on the choice of parameter coordinate 
representation (e.g. a vs. v = a 2 ). One might be tempted to define a good parameter estimate 
as one with no median bias, which at least would be coordinate invariant. However, narrowness 
of the parameter estimate distribution around the true value is also an important consideration 
and there is no guarantee that an estimator that is median unbiased will also have the narrowest 
distribution. It is important to avoid taking too seriously (or quantitatively) the comparisons 
made below for parameter estimation within a fixed distribution family, as these comparisons are 
but a form of parameter fallacy (see [Q. The dual aspects of optimal estimators that nevertheless 
will be focussed on below are low (but possibly non-zero) median bias and a narrow distribution 
around the true value. We will see that these qualitative considerations are already sufficient to 
determine the best overall estimation approach. 

3.1 Parameter Estimation on the Circle 

We first consider parameter estimation for circular distributions. Unlike distributions on the 
line, there are only a handful of circular distributions that are actually interesting to consider 
as models for real-world data^. Here, we focus on the three flexible distribution families listed 
in Table [T] Here, f3 generally denotes a location parameter and a generally represents a shape 
parameter for the distribution (though /? is not a location parameter for the Wrapped Laplace 
distribution). In addition to the fidelity and the likelihood, we test the parameter estimation pre- 
cision of several other circular statistics named after their originators: Ajne^, GinPS Kuiper^, 
RacED, Rayleigh^HH ^ anc [ Watson^. All of these tests, aside from the Gini test, are well de- 
scribed by Mardia & JuppS^. 

We first test the precision of estimation of the location parameter f3 for the von Mises 
distribution (Fig. [7]). Here, data sets containing n = 5 points were repeatedly generated (1000 
times) from a von Mises distribution with a = 1 and /3q = n. Each data set was fit with the 
indicated statistics to generate the observed cumulative distributions of the best fit j3 values. 
Due to symmetry considerations, the median value of the fit distribution should asymptotically 
approach the true value of /3n = tt for all of the statistics (they are all median unbiased), 
with only the steepness of each fit distribution revealing the precision of the estimator. As can 
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Table 2: Statistics on the circle (c n +i = 1 + ci is denned for convenience). For maximum fidelity and maximum 
likelihood, the statistic is maximized to find the optimal parameter estimate. For all the other statistics, the minimum 
is sought. 
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Figure 7: Location parameter fitting on the circle for location parameters. For each distribution, n = 5 data points 
were repeatedly drawn (1000 times) from a von Mises distribution with a = 1 and /3 — -k (see Table[TJ by extremizing 
the various statistics listed in Table [2] To perform this fitting, the cumulative distribution and inverse cumulative 
functions must be calculated for the true model and the test parameter models. These functions were estimated by 
calculation on a grid of 101 points (across the full range of j3) followed by Stineman interpolation^. Cumulative 
distributions for these guesses using the different statistics are displayed, with the vertical dashed line indicating the 
true value and the horizontal dashed line indicating the median value of the cumulative distribution of guesses. A 
good estimator should be steep around the true value with low median bias. 




clearly be seen, both maximum fidelity and maximum likelihood provide equivalently narrow 
(and overlapping) distributions about the true value. All other statistics perform significantly 
worse. 

In Fig. [8| the more revealing consideration of the estimation of the shape (or width) pa- 
rameters a for each distribution is displayed (for these fits, f3 is assumed known and the true 
a is ao, with the exact values taken for each distribution listed in Table FT]). For the von Mises 
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Figure 8: Shape parameter fitting on the circle. For each distribution, n — 5 data points were repeatedly drawn 
(1000 times) with the fit parameter a determined by extremizing the various statistics listed in Table [2] To perform 
this fitting, the cumulative distribution and inverse cumulative functions must be calculated. These functions were 
estimated by calculation on a grid of 101 points (evenly sampled over log 10 (a/ao) from -1.5 to 1.5) followed by 
Stineman interpolation. Cumulative distributions for these estimates using the different statistics are displayed, with 
the vertical dashed line indicating the true value and the horizontal dashed line indicating the median value of the 
cumulative distribution of estimates. A good estimator should be steep around the true value with low median bias. 



distribution, all statistics are equally steep around the true value of ctQ. All statistics, aside 
from the Kuiper statistic, also have acceptably low median bias. For the Theta distribution, 
maximum fidelity and maximum likelihood provide equivalently high precision estimates of the 
relevant parameter a, with the median estimate lying near the true value of ao, flanked by a 
steep distribution above and below this value. All other distributions clearly have broader wings. 
For the Wrapped Laplace distribution, maximum fidelity and the Watson statistic provides the 
best distribution, with maximum likelihood interestingly displaying the worst estimate distri- 
bution (this could arise from the "kink" at the origin for the Wrapped Laplace, see the red 



curve in Fig. 16 for an example). Maximum fidelity, therefore, provides the most optimal shape 



parameter estimate for all of the tested distributions. 



3.2 Parameter Estimation on the Line 

We now examine the parameter estimation precision for multiple distributions on the line 
(Table [3]) of maximum fidelity and other statistics (Table [4]), including th e cla ssical statis- 
tics named after thei r (co- )originators: Anderson-Darling^, Cramer-von Mises^^, GinPS and 
Kolmogorov-Smirnov^sni. In addition to these, other statistics based on Equal Intervals (the 
sum of the deviation of each cumulative interval from l/(n + 1)) and Order Statistics (sum of 
the logarithm of the order statistics^K^) are also considered. 
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Table 3: Probability distributions on the line. 
In Fig. [9j a "location" test is performed: The estimation precision for the mean value of a 
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Table 4: Statistics on the line (co = and c n +i = 1 are denned for convenience). For the Cramer- von Mises, Equal 
Intervals, Gini, and Kolmogorov-Smirnov statistics, the minimum is considered optimal. For all the other statistics, 
the maximum is considered optimal. 
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Figure 9: Fitting of a location parameter for a Gaussian on the line. For each model, n = 5 data points were drawn 
repeatedly from a Gaussian with fi = and a = 1. Fitting of the Gaussian mean /i (assuming a fixed a = 1) 
was carried out for each realization (1000 total) using the listed statistics. All listed statistics lead to overlapping 
distributions except for the Equal Intervals and Gini statistics. 



Gaussian (with width a — 1 assumed known) is compared across the statistics listed in Table |lj 
All statistics, excluding the Equal Intervals statistic and the Gini statistic, have equal precision 
for this "location" test. 

In Fig. [Toj the parameter that more generally dictates the shape or width of the distribution 
(rather than its location) is fit using the statistics listed in Table[I]for each indicated distribution 
family (/3 and ceo values listed in Table|3|. For these statistics, the estimation of shape parameters 
appears to be a more stringent test than the estimation of "location" shown in Fig. [9| Maximum 
fidelity provides the best estimate for all of the tested distributions, with maximum likelihood 
giving good but typically more median biased estimates. For the Student t distribution, the 
estimate provided by maximum fidelity is far superior to that provided by maximum likelihood 
or any of the other statistics. In all cases, therefore, the fidelity is the clearly superior choice for 
parameter estimation of distribution width or shape. 



3.3 Simultaneous Fitting of Gaussian Mean and Width 

In the previous sections, only a single model parameter was fit. However, the most important 
example of parameter fitting in statistics is the simultaneous fitting of the central value \x and 
width u of a hypothesized Gaussian. In Fig. a side-by-side comparison of maximum fidelity 
with maximum likelihood and maximum spacings is displayed for data sets containing n = 5 
points (1000 realizations) drawn from a Gaussian (fi = 0, a = 1). This trio of statistics is rem- 
iniscent of the choice of porridges presented to Goldilocks. Maximum likelihood and maximum 
spacings prefer Gaussian widths that are respectively narrower ( "hotter" ) or broader ( "colder" ) 
than the estimates obtained using the standard deviation, whereas the estimate distribution 
obtained from maximum fidelity perfectly overlaps that obtained from the standard deviation 
("just right"). The standard deviation is generally considered to provide an "optimal" estimate 
(as the mean bias of the variance estimate v — Og D is zero) . The bias of maximum likelihood 



towards distributions with higher peaks is well known (see also the below discussion of Fig. 12 1, 
as is the need for a mean bias correction of the likelihood estimates to match the true a 2 : 
a 2 = (og D ) = (n/(n — l))(f^ 1L ). The n/(n — 1) correction of the maximum likelihood estimates 
not only removes the mean bias of o 2 ^, but also generates an estimate distribution that per- 
fectly overlaps the standard deviation and maximum fidelity estimate distributions (not shown). 
That maximum spacings assigns too much weight to the boundary intervals is clear from Fig. [5j 
which accounts for its preference for flatter distributions. Testing of different n (not shown), 
leads to similar results, proving their generality. It is important to note that for any given data 
set, the separate a estimates obtained from maximum fidelity and from the standard deviation 
will generally differ, it is only the full parameter estimate distributions that are identical. While 
maximum fidelity coincides with the standard deviation in its power of estimation, as we will 
see in the next section, it also automatically provides a coordinate-independent measure of the 
degree of concordance of the hypothesized model with the data. Such valuable information is 
unobtainable from the standard deviation estimate or from the maximum likelihood estimate. 
Furthermore, parameter estimation by maximum fidelity is possible for any distribution family, 
unlike the standard deviation, and is also always reliable, unlike maximum likelihood (sec <|T]). 
All of these considerations point to the fidelity as the most fundamental statistic for parameter 
estimation. 



3.4 Comparison of Maximum Fidelity and Maximum Likelihood 

A comparison of the fidelity and the likelihood is given in Fig. [l2j Model fitting in statistics is 
often restricted to a family of distributions (e.g. Gaussians) in order to enable the calculation 
of probable errors, confidence intervals, credible intervals, etc. Under these restrictions, the 
likelihood has traditionally played an important role. However, if we had complete freedom in 
the specification of the model, maximum likelihood would always lead to (5-function spikes at 
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Gini 

Koimogorov-Smirnov 



Maximum Spacings 
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Figure 10: Fitting of shape parameters on the line, n = 5 data points were drawn repeatedly from the indicated 
distributions (values assumed for /3 and an are listed in Table [3). Fitting of the shape parameter a, with ft fixed at 
its true value, was carried out for each realization using each indicated statistic (1000 total). 




Figure 11: Estimate distributions for a obtained upon simultaneous fitting of Gaussian /i and a for n = 5 data points 
(for 1000 Monte Carlo realizations). Estimate distributions obtained using maximum fidelity, maximum likelihood, 
maximum spacings, and the standard deviation, i.e. ctsd = ySiO^ — ( x )) 2 /( n ~ !)• 



each observed data point. By contrast, maximum fidelity is only concerned with the cumulative 
mapping of the data points. In Fig. |12| three data points are fit with increasingly narrower 
distributions, for which the likelihood takes on increasingly higher values. It is clear that the 
model with highest likelihood would indeed correspond to a sum of <5-function at each observed 
data point. The fidelity, on the other hand, is identical for all of the models (including the 5 
function model), as they all map the data points to the same cumulative values. 

It is clear that maximum spacings and maximum fidelity are asymptotically equivalent. But 
what is the asymptotic connection of these statistics with the likelihood? Ranneby has already 
shown that the likelihood and the spacings statistic represent different ways of approximating 
the continuous version of the fidelity (or Kullback-Leibler divergence)^. However, as argued 
above in fj2j it is better to start with the more intuitive disc rete version of the fidelity and derive 



the likelihood as an approximation. Starting from Eq. 
probability distribution as follows: 
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we can express /„ in terms of the 



1 1 1 
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For large n, we can approximate these integrals using forward finite differences (final value of 
p(x) on each interval times the x interval width): 



lim f w log [2n(oo - x n )p(oo)} + ^- log [2n{x x - (-oo))p(aci)] 
n->oo in In 

n-l 



1 

+ ^ - log [n(x i+ i - Xj)p(x i+ i)], (27) 




Figure 12: Comparison of maximum fidelity and maximum likelihood. Three different normalized probability distri- 
butions are shown with increasingly higher peaks at the observed data values of x — 3, 5, and 9. Maximum likelihood 
gives the highest likelihood to the red model, whereas maximum fidelity gives the same fidelity for all three models, 
as all models (by design) place the data points at the respective cumulative values of c(x) = 0.1667, 0.5, and 0.8333. 



or backward finite differences: 
1 
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Taking the average of these two approximations yields a better, more symmetric approximation 
of: 
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(29) 



The fidelity on the line is therefore asymptotically equivalent to the sum of a model-independent 
normalization term (interior intervals), a model-dependent term boundary term, and 1/n times 
the logarithm of the likelihood (aside from the 1/2 exponents on the first and last terms). One 
can likewise show (with the result being equivalent for forward or backward finite differences) 
that the circular fidelity, is asymptotically: 



lim fn 



■ log [n n (x 2 -xi)...(l + xi- x n )] 



-I — log [p{xi)p{x 2 ) ■ ■ .p(x n - 1 )p(x n )], (30) 



which is equivalent to the sum of a model- independent normalization term and 1/n times the 
logarithm of the likelihood function. 



4 Conversion of Fidelity to Concordance 



In this section, I show how the fidelity can be efficiently converted to an absolute measure of 
model concordance (jp value) under assumption of the null hypothesis (i.e. the hypothesis that 
the trial distribution used to calculate the fidelity is the true distribution from which the data 
were drawn). 

Arguably the most important and widely used tool in statistics is Karl Pearson's x 2 ~P test, 
which was the first quantitative and universal approach for measuring model goodness-of-fitP3} 
Unfortunately, this test requires binned data with a sufficiently high level (> 10) of counts in each 
bin. It is therefore only reliably defined asymptotically. Since this time, several other goodness- 
of-fit measures have been proposed based on various statistics of the cumulative distribution of 
the data points^. While these alternatives are valid for any n, none of these can compete with 
the generality and power of Pearson's x 2 (in its range of validity). In this section, I will derive 
a highly accurate gamma-based approximation for the p value of the fidelity, which is valid for 
any number of data points n. As the fidelity is not limited to asymptotic situations (high n), 
it provides an even more fundamental basis for goodness-of-fit than \ 2 - As maximum fidelity is 



asymptotically consistent with maximum likelihood 3.4 ) and it is well-known that maximum 
likelihood is asymptotically equivalent to minimum x 2 j maximum fidelity is also asymptotically 
equivalent to minimum x 2 ■ As all three estimation approaches are consistent estimators in the 
asymptotic limit (see Ranneby (1984) for a proof of the consistency of the spacings statistic^ 
that is also valid for the fidelity), minimum x 2 can be thought of as simply the asymptotic 
version of maximum fidelity. This glosses over the "binned" nature of the data required for x 2 
analysis. Often, data are artificially binned (resulting in degraded resolution) in order to simply 
apply the power of x 2 analysis (with the number of data points in each bin ensured to be greater 
than 10). Use of the fidelity, however, obviates the need for such artificial binning. For data that 
is collected in a binned fashion, however, maximum fidelity can also be applied in a consistent 
fashion (see |JsJ) , which further bolsters the above statement of minimum x 2 a s the asymptotic 
version of maximum fidelity. 

Calculation of the expected distribution of a cumulative-based statistic under the null hy- 
pothesis simply requires repeatedly drawing n data points from a uniform distribution on the 
unit interval and calculating the resulting statistic. The statistics of random points distributed 
on the line, as well as the intervals that they create, was first explored by Whitworttj 61 * 62 !. In 
our particular case, we are interested in the statistics of the product of intervals on the line 
created by the n randomly drawn data points. Knowledge of this null distriution of the fidelity, 
will allow us to convert the fidelity to a concordance measure {p value). The corresponding 
null distribution for the fidelity or spacings statistics cannot be expressed explicitly. However, 
through use of the Laplace transform, Darling^ was able to calculate the exact solution for the 
characteristic function corresponding to the distribution of the spacings statistic (sum of the log 
of the intervals for n points on the line): 

= = ir~. I e z ( e- rz+l ^ osr dr) dz 
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By the same method, the corresponding result for the distribution of the fidelity for n points on 
the circle is: 

/ v Mr(i + -))" 
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and the distribution of the fidelity for n points on the line is: 
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(33) 



The first two cumulants (corresponding to the mean and the variance) of the null distribution 
for each statistic follow from differentiation of the log of the above characteristic functions^. 
The result for the spacings distribution on the line (already obtained by Darling^) is: 
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The corresponding cumulants for the fidelity distribution on the circle are 
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and for the fidelity distribution on the line are: 
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In the above expressions, 7 = 0.5772156649 ... is the Euler constant and 
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(34) 
(35) 

(36) 
(37) 

(38) 
(39) 

(40) 



are the polygamma functions. Also, in the above, all expressions in parentheses approach zero 
asymptotically (n — > 00). 

A simple gamma-function approximation to the null distribution of the fidelity based on the 
first two cumulants can be defined as follows: 
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where the geometry g is either c (fidelity on the circle) or I (fidelity on the line). The corre- 
sponding p value is then given by: 
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where Q(s, x) is the regularized gamma function. Setting the first two cumulants of the gamma 
distribution (which completely specify it) to the exact solutions given above yields: 
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Table 5: Complete set of explicit solutions and power-law or gamma-function approximations for the p value of 
fidelity for each n on the circle and the line. 



With this definition, remarkably good agreement with Monte Carlo calculations is obtained 



for all n > 4 on the circle and n > 3 on the line (see Fig. 13). For n = 1 on the circle, the 
distribution is trivial. For n = 2 on the circle and n — 1 on the line, the distributions can be 
solved for exactly: 

">™ = tBsw (45) 

p»(/») = 1-v/l-cM, (46) 
For n = 3 on the circle and n = 2 on the line, the distributions are approximated very well by 



a simple exponential (Fig. 13): 



P^{f a n ) * -^e-^S (47) 
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fin 
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The complete system of approximations for the p value — summarized in Table [5] with com- 
puted coefficients listed in Tables [6] and [7] for a range of n — serves as a solid and extremely 
efficient foundation for assessment of model concordance using maximum fidelity. For the spac- 
ings statistic, Cheng & Stephens^ presented a similar, though slightly less accurate approach 
obtained by fitting the null distribution to a % 2 distribution with n degrees of freedom plus a 
displacement term correction. 



4.1 Concordance Landscape 

Armed with the powerful approximations derived in the previous section, we can directly convert 
the fidelity to concordance (p value) for a specific model fit to an observed data set (as was 
previewed in Fig. [T]). We can also immediately plot entire concordance landscapes over the 
fitting parameters for a particular observed data set. This is demonstrated for a Gaussian fit 
(mean /i and standard deviation a) to three different observed data sets (all with n = 20) in 
Fig. [l4j Note that the maximum p value in each case depends on the data set. Concordance 
determined through the fidelity is, of course, completely general for any model distribution, not 
just Gaussian distributions. In Fig. |15| similar concordance landscapes are shown for the fitting 
of Extreme Value distributions to three different data sets (all with n = 40) . 

In the concordance landscapes above, every point in the space represents an individual 
model for the data and gives an absolute p-value measure of goodness-of-fit, echoing Peirce's 
dictum that induction "sets out with a theory and it measures the degree of concordance of that 
theory with fact." (Peirce 5.145)^J If we decided we were interested in maximizing the fidelity 
over the model parameters and only considering the u p value obtained at the maximum" , then 
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Figure 13: Peak-normalized probability distributions of the fidelity on the line and circle under the null hypothesis 
(1 million realizations at each n). Overlaid in black are the approximations for the fidelity summarized in Table [5[ 
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Table 6: Gamma-function approximation coefficients for determining the p value from the fidelity on the circle. 
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Table 7: Gamma-function approximation coefficients for determining the p value from the fidelity on the line. 




we should be wary of a few things, including the number of independent model degrees of 
freedom. We should first consider the absolute p value at the maximum as defined above. If 
this value is very low, we can of course reject the entire family of distributions. If the p value 
is acceptable, there are two choices: We could retain the absolute p value and look at how 
large a domain of acceptability exists (e.g., examine the zone over which the absolute p value 



is p > 0.05 in Figs. 14 and 15 1. Alternatively, we could attempt to "correct" the absolute p 



value for the independent parameter degrees of freedom over which we have maximized our 
model, as Fisher proposed for Pearson's % 2 test) 66 * 67 l This "correction" provides information at 
the distribution family level, asking whether, for example, the data can be fit by a Gaussian 
(parameters undefined). It theoretically allows rejection of otherwise suitable fits determined by 
the absolute p values. However, such a correction requires further assumptions about the shape 
of the fidelity or concordance landscape (sufficiently Gaussian) that are valid only asymptotically 
(large n). It also prohibits comparison of the particular maximized model with other models 
on the concordance landscape (these degrees of freedom are already "accounted for" in the 
maximization process). Such bias correction would also lead to a different corrected p value for 
the exact same model arrived at in a different manner (different p values would obtain for the 
exact same model arrived at by fitting fj, and a simulatcnously, or by fixing fi and fitting a; 
these corrected p values would nevertheless both be smaller than the absolute p value), which 
prevents assigning an absolute concordance value to each model. While such a correction allows 
comparisons at the distribution family level (e.g. "the data set is better fit with a Gaussian than 
with a Lorentzian" ) , it prevents cross-comparison (e.g. of a particular Gaussian model with a 
particular Lorentzian). 

While providing a certain type of information, "correction" of the p value for the number of 
degrees of freedom can actually discourage examination of other information about the model fit. 
Consider a single data point x\ fit by a Gaussian with fixed a and unknown location parameter 
p. Maximization of the fidelity would lead to fj, — x±, with a cumulative value of C\ = 1/2 at 
x\, corresponding to a maximal fidelity of / = and a p value of 1. By a literal application of 
Fisher's correction, we have zero degrees of freedom after accounting for our single data point and 
our one free model parameter, leading to an indefinite p value (division by zero), and therefore 
an indefinite statement on the goodness-of-fit of this class of models. It is clear in this case, 
however, that the concordance landscape of absolute p values contains important information 
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both about the range of models with good concordance p > 0.05 as well as those that provide 
unacceptable fits (and exactly how unacceptable). This is true for data sets containing multiple 
points as well. Consider n data points fit with a model with n linearly independent parameters. 
The corrected p value is undefined (division by 0); however, the absolute p value evaluated across 
the parameter landscape is still highly informative. 

All of the preceding glosses over one important point about scientific inference. When we 
attempt to fit data with a model, often we have already looked at the data (inspection bias) 
or, if we have not looked, we have a physical model in mind (theory bias) or we are simply 
attempting to fit the data using a well-known distribution family (distribution bias). All of 
these biases are, unfortunately, unquantifiable. Therefore, when we attempt to fit our data with 
a Gaussian and then notice the wings are not fit well, so we attempt a Lorentzian, this bias 
cannot be quantified, even if we "correct" our absolute p value for the degrees of freedom of the 
model. Instead of choosing a different model with the same number of free parameters, we could 
have alternatively added an extra degree of freedom to the model (e.g. a skewness or a second 
Gaussian). Fisher's correction supposedly gives us guidance on how to determine whether the 
introduction of that extra degree of freedom actually generated an informative improvement in 
the "corrected" p value of the maximized model fits. However, we may have introduced exactly 
that degree of freedom in order to explain some feature in the data. This type of bias appears 
unavoidable. Therefore, it is likely best to retain the absolute p values and instead judiciously 
apply Ockham's razor to isolate the simplest models that are sufficiently concordant with the 



data (sec 1 12 1 



For those who are nevertheless interested in "correcting" the p value, Cheng & Stephens give 
some useful guidance for "correcting" the spacings statistic based on its asymptotic equivalence 
with the likelihood function^ their analysis should be directly applicable to the fidelity. 

4.2 Goodness-of-fit on the Circle 

A comparison of different goodness-of-fit measures applied to multiple circular distributions is 
presented in Fig. |16| For distributions on the circle, a good reference distribution is simply the 
uniform distribution. For the cumulative curves shown in Figs. |16}3-E, n = 10 points were re- 



peatedly drawn by Monte Carlo (10000 realizations) from the displayed distributions in Fig. 16 ^ 



and compared against the assumption of a uniform distribution. How well a particular distribu- 
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Figure 16: Concordance values for hypothesis testing on the circle. (A) Probability distributions used to generate data 
sets with n — 10 points for testing against the assumption of a uniform distribution. (B) Cumulative distributions 
of p values obtained using the the indicated goodness-of-fit tests for 1000 realizations of n — 10 points drawn from a 
von Mises distribution with a = 1, /3 = (black curve in A) and tested against the assumption that the data were 
drawn from a uniform distribution. In the further panels, cumulative distributions of p values similarly obtained are 
shown for data drawn from (C) a Wrapped Laplace model with a = 0.5 and /3 = 2 (red curve in A), (D) a cosine 
distribution: ^~ (1 + cos 29) (green curve in A), and (E) another cosine distribution: J- (1 + cos 10#) (green curve in 
A). 



tion can distinguish these deviations from uniformity is assessed by comparing the cumulative 
distribution of the p values obtained by each approach. A distribution that can discriminate well 
should lie towards the upper left of these panels. Note that most of the tests have more power 
than the fidelity for discrimination of a "location" parameter (Fig. fl6]B). However, the fidelity 
has equivalent or higher power for discrimination of shape parameters (Fig. |l6p -E). For the 
extreme case of a highly oscillating distrbution shown (Fig. |16p ), the fidelity performs much 
better, with the other statistics having essentially no discriminatory power. That the fidelity 
has the power to discriminate in the manner shown in Fig. |16p is due to its consideration of the 
local information of the distribution. This sensitivity to local information and thus more general 
discrepancies, however, comes at a slight cost for the specific discrimination of differences in 



"location" shown in Fig. 16 3 (see further discussion in (12). 



4.3 Goodness-of-fit on the Line 

In Fig. |17[ multiple methods for determining goodness-of-fit are compared for different distribu- 
tions on the line. To perform this comparison, data points were drawn from the different model 
distributions shown in Fig. |17A and compared against the assumption of a specific Gaussian 
model with \i = and a = 1 (black curve in Fig. [17]A.). In Fig. \Ufp, data were drawn from a 



shifted Gaussian, with the fidelity exhibiting less discriminatory power than the other tests. In 
Fig. [T7p, data were drawn from a broader Gaussian with the same mean. In Fig. [T7p, data 
were drawn from a narrower Gaussian. In Fig. |17p , data were drawn from a Cauchy distri- 
bution. In Figs. |17p— E, the fidelity provides the best discrimination. For the slight drop in 
power for "location" testing (Fig. |17f 3), one gains much greater power in discriminating more 
general, shape-dependent differences. In the opposite sense, the sensitivity of the other tests 
to "location" differences, comes at the cost of identifying other types of discrepancies between 
the distributions. On the whole, the fidelity offers a more balanced test for general differences 



between distributions (sec further discussion in £12) 



5 Joint Analysis 

Highly accurate joint analysis of separate data sets on the line or the circle can be performed 
based on the following definitions (see Ranneby (1984) for a similar formalism for the spacings 
statistical): 



fa = -yWn, (49) 

i=l 
1 k 

- ^2 n ^«* ( 50 ) 



n 

i=l 



1 k 

n iE^. ( 51 ) 



where n = X)i=i n i- The p value in this generalized case is: 

c/a 



Pn(fn)= P n {x)dx^Q{4Jai^ r J n /ai). (52) 



This approximation works extremely well for all combinations of circular and linear data (in- 
cluding combinations of circular data with linear data) with only the following three exceptions: 
m = 1 (line) with ni — 1 (line), m = 1 (line) with 712 = 2 (circle), and n\ = 2 (circle) with 
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Figure 17: (A) Data points (n = 10) were repeatedly drawn (1000 realizations) from the displayed distributions 
(red, green, blue, magenta) and tested against the assumption of a Gaussian model with /i = and a = 1 (black). 
Cumulative distributions of the obtained p values are shown using the indicated goodness-of-fit tests for data points 
drawn from (B) a Gaussian with fi = 1 and a — 1 (red model in A), (C) a Gaussian with [i = and a = 2 (green 
model in A), (D) a Gaussian with /i = and a = 1/2 (blue model in A), and (E) a Cauchy model with f3 = and 
a = 1/10 (magenta model in A). 
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Zo„=i 



Fidelity: f m =2 R„ log 

Figure 18: Illustration of joint analysis. Three independent data sets are shown with mi = 2, m,2 = 5, and = 3 
points. An hypothesized model is simultaneously fit to all three data sets leading to the model-derived cumulative 
values, d. Here we assume that the data cannot simply be combined together into a single data set containing all of 
the points due to the different circumstances under which each individual data set was taken. Each point contributes 
equal weight to the final fidelity, with boundary intervals contributing the usual half weight. The fidelity is computed 



according to the displayed equation (same as Eq. 191, with the p value obtained according to Eq. 52 



77,2 = 2 (circle). For these cases, a power law should be used based on the mean value obtained 
above to determine the p value: 

Paifft) - cxp(-/ ? i//i,i). (53) 

This is similar to the need for a power-law approximation for single data sets having either n = 2 
points on the line or n — 3 points on the circle. Useful examples of joint analysis can be found 
in Sj6] in the context of generalized "t tests" for comparing two data sets and in jfT0|for the case 
of multidimensional data. 



6 Generalization of Student's t Test 

One of the most commonly used tests in statistics is Student's t test^, which was developed by 
William Gosset to determine the likelihood that two separate data sets (drawn from Gaussian 
distributions with equal variance) have the same mean. Student's t test was subsequently gen- 
eralized in many different ways, e.g. to treat the case of unequal variances by Welch (Welch's t 
test)^lor to treat the comparison of more than two data sets (Fisher's F test, which underlies 



ANOVA). The important assumption in all of these tests is that all of the true, underlying 
distributions are Gaussian. In the language of maximum fidelity, these tests are all consistently 
treated in the same manner, with, importantly, no assumption of Gaussianity required. This is 
possible because the fidelity and associated p value can be applied to arbitrary distributions to, 
in this case, allow direct assessment of the concordance of the complete model (comprised of the 
individual models for each data set) to the complete data (all data sets at once). 

Consider two independent data sets with n\ and ni data points. We would like to compare 
the concordance of different hypothesized complete models to the complete data. We obtain 
three informative p values, p\ for the model fit to the first data set alone, P2 for the model fit 
to the second data set alone, and p for the complete model fit to both data sets together (using 
the joint approach explained in f|5]). First, we hypothesize that the points in each data set are 
drawn from the same underlying Gaussian distribution specified by fj, and a. A parameter space 
search determines the solution that maximizes the fidelity. The concordance (p value) for this /io 
and (Jq is recorded (Fig. 19 lower right). A second hypothesis could be that both distributions 
have the same a but different means [i\ and fi2 (Fig. |19[ upper right). A third hypothesis could 
test how much improvement occurs upon assuming the same mean fi but different U\ and er 2 
(Fig. 19 lower left). Finally, we can test if distinct [i\ and \ii and <j\ and (T2 provide a significant 
improvement in p value (Fig. 19 upper left). These plots are only shown for the maximum value 
of p (as a representative value of the full concordance landscape) with no correction for the 
"degrees of freedom" implied by the assumption, see Qfor further discussion of this issue. 

To demonstrate that maximum fidelity is not limited to merely the comparison of Gaussian 
distributions, a similar example is shown in Fig. [20] based on data drawn from two different 
Extreme Value distributions. It is important to note that mixed cases are also treated in the 
exact same manner: the first data set could be assumed to be drawn from a Gaussian distribution 
and the second from an Extreme Value distribution. The associated p value will always be 
calculated for the combined fit in exactly the same way and with the same meaning. 



7 Neyman paradox 



It is important to address a philosophical and mathematical argument made by Neyman, gener- 
ally referred to as "Neyman's paradox" ! 21 * 22 !. This paradox involves the supposed nonuniqucness 
of the central interval for a particular quantity estimated from the data, such as the mean or 
median (though this argument can be generalized to other parameters as well) . Neyman argued 
that the transformation y = 1/x (as shown in Fig. 21 1 moves points originally near the center 



of the distribution to the edges of the transformed distribution (and vice versa), leading to 
completely different values for the mean and median of the data, which, if used for parameter 
estimation or confidence intervals, would lead to a radically different model preference. 

The fidelity affords a simple resolution of Neyman's paradox. Neyman's mathematical sleight 
of hand neglects one important aspect: the topology of the line. This is pictorially demonstrated 
in Fig. 21 Upon the transformation of y = 1/x, the Gaussian probability distribution f(x) 
and its associated cumulative distribution F(x) shown in Fig. 21 k are transformed to g(y) and 
G(y) in Fig. 21 3. The two points near the center of the distribution f(x) are moved toward the 



extremities of the transformed distribution g(y). Imagine if we had only observed the single point 



Xi in Fig. 21 \. The closeness of x\ to the median value of F{x) implies a good fit according to 



the fidelity; however, the transformed value y\ is moved to the extreme left of the transformed 
distribution, which appears at first glance to indicate a poor fit. This neglects the topological 
change that the transformation y = 1/x generates. The boundaries in Fig. 21 at x — > —00 and 
x — > +00 are moved to y — in the transformed system. If the topology is preserved upon the 
transformation, y = does not exist as it represents the new boundary. The white circle at 
y = denotes the non-existent, boundary value nature of this point, with the values as y — > +00 
wrapping around to y — > —00. Proper tracking of the boundaries and the cumulative intervals 
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Figure 19: Maximum fidelity generalization of t tests applied to Gaussian distributions. Two data sets of m = 25 
and ri2 = 50 data points were drawn from two different Gaussian distributions having /j? = and cr? = 1 (solid blue 
curve) or = 2 and cr\ = 1.5 (solid red curve). In each panel, different assumptions are made to fit the data. In 
the upper left, different /_i and a were independently fit to each data set to find the complete model that maximizes 
the overall fidelity. In the upper right, different /i were assumed but the same a to find the complete model that 
maximizes the overall fidelity. In the lower left, different a were assumed but the same fi. In the lower right, both fi 
and a were assumed to be the same for each data set. In each panel, p\ and p2 correspond to the concordance of the 
individual model fits and p corresponds to the overall concordance of both models to the data. 
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Figure 20: Maximum fidelity generalization oft tests applied to Extreme Value distributions. Two data sets of ni = 25 
and ri2 = 50 data points were drawn from two different Extreme Value distributions having P° = 0, a? = 1 (solid 
blue curve) or = 2, a% = 2 (solid red curve). In each panel, different assumptions are made to fit the data. In 
the upper left, different ft and a were independently fit to each data set to find the complete model that maximizes 
the overall fidelity. In the upper right, different f3 were assumed but the same a to find the complete model that 
maximizes the overall fidelity. In the lower left, different a were assumed but the same f3. In the lower right, both j3 
and a were assumed to be the same for each data set. In each panel, pi and P2 correspond to the concordance of the 
individual model fits and p corresponds to the overall concordance of both models to the data. 
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Figure 21: Resolution of the Neyman paradox in the context of maximum fidelity. (A) A Gaussian distribution f(x) 
with n = and a = 1 and associated cumulative distribution i ? (a;) are displayed, along with two points x\ — —1/3 
and X2 = 1/2 and their associated cumulative values Fi and Fi shown on the right. The boundaries at i -> -oo 
and x — > +oo are displayed on the cumulative interval as open circles. (B) The transformation of y = 1/x leads to 
the transformed probability distribution g(y) and cumulative distribution G(y). The original xi and X2 points are 
transformed to y\ — —3 and y-z = 2. The boundaries at x — > ±oo are mapped to y — (open circle). Proper tracking 
of the transformed boundaries preserves the value of the fidelity, defined by the cumulative intervals Ai, A2, and A3. 



defined in the original and the transformed coordinates leads to preservation of the exact value 
of the fidelity defined by the intervals Ai, A2, and A3 displayed on the right sides of Figs. [2T^V 
and B. If we are not sure where the boundaries fall on the line, then we should assume no 
boundaries, i.e. a circular coordinate system. In this case as well, the fidelity would be trivially 
invariant to such a coordinate inversion. 

To further illustrate why the topology should be preserved upon a change in the data co- 
ordinate system, consider Fig. [22j In the upper panels, we have a circular coordinate system 
(for simplicity) with three data points in its natural coordinate system and in a topologically 
reordered system obtained by ordering the colored regions of the circle differently. With regard 
to the cumulative distribution, a fit that looks poor in the original topology can be made to 
look much better in a topologically reordered system. Arbitrary topological reordering can make 
any model fit any data set perfectly. Such topological reordering as carried out in Fig. [22] or 
topological redefinition of the boundaries (as Neyman carries out, thus making the points at 
x — > +00 directly neighbor the points at x — > —00) should not be permitted in a proper theory 
of inductive inference. If the topology were scrambled before the data were examined, however, 
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Figure 22: Arbitrary topological reordering of a physical coordinate system could be used to make an observed data 
set fit any desired model. In the top panel, data on the circle are expressed in their original coordinate system and 
in a topologically reordered coordinate system. In the bottom panel, a particular model-based cumulative mapping 
of these data points leads to a poor fit for the original topology but a much improved fit on the reordered topology. 



the correct model parameters could still be estimated with almost as efficiently as in the unal- 



tered coordinate system (Fig. 23). As I am not aware of any justification for such topological 
reordering (or boundary reassignment) of real-world data, it seems safe to conclude that such 
rearrangements should be restricted for the proper application of statistical inference. In other 
words, we should at least be able to agree on the overall topology of the data coordinate system, 
even if we cannot agree on the best choice of topology-preserving coordinate system. If we can- 
not agree on this restriction, then we must indeed embrace Neyman's argument as an argument 
against a fundamental basis for induction, leaving us with only Neyman's unsatisfactory notion 
of good inductive behavior^. 



8 Binned Data 

The generalization of maximum fidelity to univariate binned data is straightforward but requires 



more computation. Binned data hypothesized to come from a Gaussian are given in Fig. 24 \. 
The fidelity landscape and corresponding concordance landscape shown in Figs. [24)3 and C were 
generated in the following way. The landscapes were sampled at 30 values along the displayed 
abscissa and ordinate. At each value of /i and c, the cumulative distribution was tabulated 
at the boundaries of all of the x bins in Fig. [24] A. that contain points. Within each of these 
bins, the points were randomly assigned cumulative values drawn from a uniform distribution 
spanning the cumulative boundaries. The fidelity was then calculated. This was repeated 999 




Figure 23: Parameter estimation of a model (Theta distribution with Qo = 4 and fixed ft = 0, see Table[T]) carried out 
on the original topology of the physical coordinate system and on the scrambled topology shown in the top panel of 
Fig. [22] If the topology is scrambled before the data are observed, parameter estimation and p value goodness-of-fit 
can still be carried out with only a slight reduction of efficiency for the toploogically reordered system (with this 
reduction likely due to the model discontinuities introduced by the topological reordering). 



times, with the representative median fidelity value (the 500th ordered value) then displayed in 



Fig. |24p. The fidelity was then converted to concordance (p value) according to Eq. 42 With 
modern desktop computers, the generation of Figs. [24^ 3 and C take only a fraction of a second. 
This is what I will refer to as the "exact method" for parameter estimation and concordance 
determination for binned data. 

A more economical method valid only for parameter estimation would be to maximize a 
fidelity estimate obtained by calculating the cumulative values at the bin boundaries and then 
redistributing the data points in a way that maximizes the fidelity within that bin. One can then 
efficiently maximize this fidelity estimate without the tedious 999 Monte Carlo simulations for 
each trial choice of model parameter values required above. Specifically, for a bin k containing 
p data points, the data points are spread evenly over the cumulative range of the bin according 
to the following formula: 

^)(^-4)+4 (54) 

where i ranges over the list of data points (from 1 to p), j ' + 1 denotes the first data point in 
bin k (out of the list of total data points), and c l k and cj£ denote the lower and upper values 
of the model-specific cumulative distribution over the bin k. The fidelity is then calculated as 
usual (Eq. [24} . This method for estimating the fidelity for binned data has already been used in 
the real-world context of determining fiuorophore lifetimes and lifetime populations in Walther 
et al. (2011)^ (where it was also shown that maximum fidelity allowed for better estimation of 
lifetime populations than maximum likelihood). 



9 Binary Distributions 

Consider ko successes out of n trials of a particular binary process. One could hypothesize an 
independent binomial process (Bernoulli process) to describe the observed set of events, with 
the probability of the particular outcome of kg successes given by the corresponding term in the 
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Figure 24: Fitting of a Gaussian model to binned data. (A) Binned data. (B) Contour map of the fidelity. (C) Contour 
map of the associated p values. See text for further details. 



binomial expansion: 

(?+(i-?)) n =E(l)9 i ( 1 -?r*' ( 55 ) 

Here, q denotes the probability of success for each individual event. Obtaining a measure of how 
adequate a particular q value is (or a range of q values) to account for the observed fco out of n 
successes is a problem with a suprisingly long and convoluted history already partially explored 
in ||T] The cumulative distribution values for a binomial distribution with success rate q defined 
at the observed value of fco are: 

C[(n,q;k ) 
c m (n, q;fc ) 
c h (n,q;k ) 

which have the following meanings: c;(n, q; fco) ("low") corresponds to the cumulative value ob- 
tained from summing the contributions from k — to one less than the observed value k — fco — 1, 
Ch{n,q;ko) ("high") corresponds to the cumulative value obtained from summing the contribu- 
tions from k — to k — fco, and c m (n, g;fco) ("middle") corresponds to the midpoint value of 
the cumulative interval defined for the particular outcome fco (c m is the average of c/ and Ch)- 
Under the assumption of a binomial process, the particular binomial distribution (parametrized 
by q) that maximizes the fidelity is the one that centralizes the cumulative distribution at the 
observed value of fco. This is equivalent to maximizing the fidelity for a data set containing a 
single point, in this case fco on the discretized interval from fc = to n. This requires solving: 

c m (n,q;k ) = - (59) 

for q. For example, for n = 10 and fco = 3, the value of q that maximizes the fidelity is 
q Ri 0.306089 (which differs from the value q = 0.3 obtained from maximum likelihood). Due 
to the fact that fco can be located at one of the boundaries (fc = or fc = no), it is best to 
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retain the cumulative value (or interval) at the point fco as an indicator of goodness-of-fit rather 
than convert to a symmetrized p value (see further discussion below). The cumulative function 
value(s) determined over the observed bin fco already contains all of the information we need (in 
fact, it contains even more information than the p value, as the particular left-right asymmetry 
of the cumulative value from its optimal value of 1 /2 is preserved) . 

The cumulative distributions c m (n,q;ko) (along with the lower and upper extent of the 
cumulative intervals respectively given by c;(n, q;ko) and Ch(n,q;ko)) are shown as a function 
of fco for n — 10 and for different values of the probability q in Fig. [25} Consider the outcome 
fco = 3. For the q = 0.306089 distribution (black intervals), the cumulative interval at fco 
is centered at 0.5, symmetrically spanning the values c = 0.367 to 0.633. Determination of 
goodness-of-fit (measured by the cumulative interval) for other values of q proceeds as follows. 
For the assumption of q = 0.1 (red intervals), the outcome of fco = 3 would occur over the 
cumulative interval spanning c = 0.930 to 0.987. For the assumption of q = 0.55 (blue intervals), 
the cumulative interval at fco = 3 spans c = 0.027 to 0.102. Because both distributions map 
the value fco — 3 to cumulative intervals at the extremities of the cumulative distribution, 
they can be considered poor fits. The central 90% "confidence interval" over which the full 
cumulative bin at fco = 3 is never greater than c = 0.95 (c/j(10, q; 3) = 0.95) or less than c = 0.05 
(c/(10, q; 3) = 0.05) corresponds to the range q = 0.150 to 0.507 (the Clopper-E. Pearson "exact" 
interval^). One could alternatively calculate the values at which fco = 3 is centered at c = 0.95 
and c = 0.05, which depends only on c m (10,q;3). This would correspond to the slightly larger 
range of q = 0.107 to 0.571. 

Consider what happens if we observe fco = out of n = 10 events. If q = is assumed (the 
lowest value possible), the cumulative value at fco = will be c m (10, 0; 0) = 0.5 and can take on 
no larger value. By retaining the cumulative mapping, this fco = example merely demonstrates 
that at the lower boundary, only extreme cumulative values (intervals) at fco at the lower range of 
the confidence interval are necessary to consider (for increasingly larger q > 0). This important 
asymmetry of the above-constructed confidence intervals is missing in the Clopper-E. Pearson, 
who attempt to define a central 90% confidence level no matter the value of fco, which will clearly 
fail at the boundaries, as pointed out by Wilson^. These boundary concerns can be neglected 
for continuous data, as data points that would fall directly at the boundaries of a continuous 
interval would occupy a set of measure zero. Wilson's preferred solution^ is to consider for an 
hypothesized distribution all outcomes with probabilities greater than or equal to the probability 
to obtain fco. This is akin to a likelihood approach. Because this is carried out on discrete data, 
the likelihood has an invariant meaning, as coordinate changes, which affect the likelihood value 
for continuous data, do not exist. The extension of Wilson's approach to continuous distributions 
would therefore correspond to a likelihood ratio analysis, which should be rejected for the reasons 
given in [JTlfor the likelihood fallacy. Wilson's approach for this discrete situation is in general not 
as ideal or as intuitive as use of the cumulative interval, as different distributions have different 
shapes, with the likelihood being a better or worse discriminating statistic depending critically on 
these shape considerations. Consider the following extreme example of a flat (and therefore non- 
Bernoulli) hypothesized probability distribution over a cluster of fc values that can be shifted 
to cover fco. Wilson's approach would give a similar concordance if fco lies anywhere within 
the flat distribution, whereas the cumulative method outlined above would give a preference 
for fco to lie at the center of the flat distribution (representing a better estimate of this shift 
parameter), with an increasingly worse concordance as fco is shifted to either extremity. This 
same criticism would apply to maximum likelihood on continuous distributions. The shape- 
dependent discriminatory power of maximum likelihood is therefore not as general as the fidelity, 
for which model discrimination is based on the universal nature of the cumulative distribution. 

An example of a binary process is Peirce's consideration of the occurrence of the number 5 in 
the sequence of it (Peirce 7.121-122)^1. jj e found that 5 occurred 33 times in the first 350 digits 
of 7r and 28 times in the next 350 digits, with both values within the statistical expectation 
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Figure 25: Binomial cumulative distributions (n = 10) for q = 0.1 (red), 0.306089 (black), and 0.55 (blue). The 
cumulative interval for a given number of successes fco is defined by its midpoint value given by c m (10,g;3) and its 
symmetric lower and upper extents respectively denned by c;(10,g;3) and Ch(10,q;3). 



for a binomial process with q = 0.1 for the occurrence of 5 (success) versus the other 9 digits. 
Consider the even simpler case of what the fidelity tells us about the occurrence of l's in the 
expression of tt in binary units, tt = 11.00100100001111110110101010001000 . . . Within the first 
1000 digits of 7r, l's appear 489 times. The cumulative value for this outcome assuming a binomial 
process with q = 0.5 is c m (1000, 0.5; 489) = 0.243. The value of q that centralizes the cumulative 
distribution at fco = 489 is, unsurprisingly, q ~ 0.489. More interestingly, the lower and upper 
limits for q for which the midpoint cumulative value attains either c = 0.95 or c = 0.05 based 
on c m (1000, q; 489) are, respectively, q ~ 0.463 and q ~ 0.515. 

All of the above pertains to the assumed model of a binomial process of independent events 
with fixed probability q for each event (Bernoulli process) . However, a given observed sequence 
of binary events need not be generated by such a process, as Wilson'^MI pointed out with regard 
to the "Lexian ratio" (discussed in fjl]). Consider the following extreme example. Assume that 
the experiment of n = 10 events is repeated multiple times, with the result always being fco = 3. 
One simple way this could happen is if the n = 10 events are blue (success) or red (failure) 
balls drawn from a bag (without replacement) that itself contains only n = 10 balls (3 blue, 
7 red). If only one such sequence exists (and absent any knowledge about the total number of 
balls in the bag) , a Bernoulli process for the events should be given similar consideration to any 
other model that would roughly centralize the cumulative distribution at the observed number 
of successes (absent any other information about the events). The often observed deviations of 
real-world data from a Bernoulli process is due to similar additional constraints on the system 
that may not be clear from the outset. This again reinforces Wilson's expressed hesitation to 
assign a definite probability for the Bernoulli parameter to take on any particular range of q 
values defined by his interval^. 



10 Multidimensional Data 



For multidimensional data, the situation is more complex, as there is no way to uniquely define 
the cumulative distribution in higher dimensional spaces (i.e. there is no natural topological 
ordering of points in dimensions higher than ID). Nevertheless, in the following, I will present a 
method based on "inverse Monte Carlo" that allows extension of the fidelity to 2D (and higher) 
distributions for both parameter estimation and concordance determination. Significantly, the 
fidelity and associated concordance values obtained below are coordinate independent. It should 
be noted that Ranneby and colleagues have proposed a method for extending maximum spacings 
to higher dimensions based on tesselations of the coordinate space representation of the data^, 
which unfortunately makes their approach dependent on the exact choice of coordinate system 
[coordinate fallacy). 

The approach proposed below views inductive inference in higher dimensional spaces as 
a form of "inverse Monte Carlo," which is already the implicit basis of the fidelity in ID. 
The mapping in Fig. [I] from the data coordinate space to the model-based cumulative interval 
is simply an inversion of the normal process of Monte Carlo, for which points are randomly 
chosen on the unit interval (y-axis) and then mapped onto the coordinate space (x-axis) via the 
cumulative distribution. The fidelity can be thought of as a seemingly optimal measure of how 
random the inverse mapping shown in Fig. [T] appears on the unit interval for each hypothesized 
model distribution, fulfilling the task that Pearson described in 1933^. 

Now consider the generalization to 2D. Specifically, assume that one wishes to simulate a 
2D Gaussian distribution of data points. This problem can be approached in different ways. 
One could discretize the space, order the bins, and then assign each event to one of the ordered 
bins, which are individually weighted according to the probability distribution. However, due to 
the discretization, this is merely an approximate simulation. A better way to simulate the data 
without sacrificing resolution is as follows. Events can be simulated separately in the x and y 
directions over the unit square. These events can then be stretched and rotated to conform to 
the desired 2D Gaussian distribution. Another method could be to generate random events in 
r and 9 over the unit sphere (circular disc). These events can then be stretched and rotated 
to conform to the 2D Gaussian distribution. The methods for inductive inference demonstrated 
below draw on these symmetries and can be considered a form of 2D "inverse Monte Carlo", 
with the task of finding the distribution that best concords with the observed data points in 
both dimensions. 

The essential notion of 2D inverse Monte Carlo is displayed in Fig. [26j In this figure, the 
Monte Carlo-based simulation of data points drawn from a 2D Gaussian distribution is shown. 
Due to the symmetry of the distribution function, the following approach can be taken. First, 
two random numbers between and 1 are drawn from a uniform distribution on the unit interval 
and designated c r and eg. These points can be displayed on two separate unit intervals or within 
the unit circle (with eg going from to 1 representing a 2ir rotation). For each point, we can 
convert c r and eg to the coordinates r and 9 of a circularly symmetric Gaussian (second plot) 
according to the following equations: 

r = v/-21og(l- Cl ,) 

9 = 2ncg. (60) 

Scaling of this distribution along the major/minor axes (aligned in the x and y directions) is 
then carried out (third plot), followed by rotation (fourth plot), and translation to reach the 
desired physical coordinates of the system under consideration (fifth plot). All steps in this 
process are of course reversible, which permits the inverse process of mapping the observed data 
points to the unique values c r and eg for a particular hypothesized Gaussian. With these values 
in hand, the fidelity and associated p values can be calculated immediately using joint analysis 
(see <|5| of the n coordinates c r on the line and the n coordinates eg on the circle. I will refer to 
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Monte Carlo of a 2D Gaussian 
► 

cumulative circularly scale: rotate: translate: 




4 

Inverse Monte Carlo 



Figure 26: Monte Carlo and inverse Monte Carlo for a 2D Gaussian. Consider first the Monte Carlo simulation of 
the data. First, cumulative values c r and eg are chosen from the unit interval (also shown within the unit circle, 
with eg ranging from to 1 corresponding to a full 2tv rotation). Second, these data points are converted to r and 6 



values for a 2D Gaussian by Eq. 60 Third, coordinates for the data points are stretched along the major/minor axes 
of the Gaussian, which is aligned along the x and y axes. Fourth, rotation by n/3 is carried out. Fifth, the points 
are translated to their final coordinates. Inverse Monte Carlo corresponds to mapping the observed data points to 
the intervals c r and eg by a reversal of the above transformations. This enables calculation of the fidelity and the 
corresponding p value for an hypothesized 2D Gaussian model. 



this overall approach as the r-6 transform. 

This is not the only possible inverse Monte Carlo approach. We could have also started from 
the symmetric Gaussian displayed in the second panel in Fig. 26 to separately Monte Carlo 
the x and y coordinates of the data. Due to the symmetry of the Gaussian, the x and y values 
can be independently generated, then stretched, rotated, and translated to the final physical 
coordinate system. This can also, of course, be reversed to yield a second inverse Monte Carlo 
approach based on the c x and c y values on the line. I will refer to this as the model-based x-y 
transform. 

A final approach for generating cumulative values, which does not involve inverse Monte 
Carlo, is to simply project the observed data onto the x and y axes and then compute the 
separate cumulative projections of the hypothesized model on each axis. The observed data 
points can then be mapped by the projected cumulative distributions to determine their c x and 
c y values (which will differ from the x-y transform). I will refer to this as the coordinate-based 
x-y transform below. 

In Fig. [27j different 2D Gaussian models are used to fit data drawn from a 2D Gaus- 
sian (black contour at la) with three different transform methods used for calculating the 
fidelity /concordance: the recommended r-6 transform (bold font), the model-based x-y trans- 
form (normal font), and the coordinate-based x-y transform (italic font). The r-6 transform is 
recommended over the other two for the following reason. One can easily consider a model that 
is symmetric with respect to the original physical coordinate systems (indicating no correlation), 
but that would project onto the physical x and y axes (coordinate-based x-y transform) in a 
way that would mimic the projection of the Gaussian distribution shown in the last panel of 



Fig. 26 This weakness would affect the model-based x-y transform in a similar fashion as well. 




Figure 27: 2D Gaussian fitting of data drawn from a 2D Gaussian, p values are shown for the r-9 transform (bold 
font), the model-based x-y transform (regular font), and the coordinate-based x-y transform (italic font) for each 
of the displayed trial models: the true model in black (a;o = 7, yo = 3, ao — 3, bo = 2, 4>o = tt/3), the magenta 
model (ai = 0.8ao), the red model (02 — 4>o + n/10), and the green model (x$ — xo + 3, 2/3 = yo + 3, 03 = 1.5ao, 

03 = 00 — Tf/15). 



Models that would project onto the r direction and the 9 direction in a way that mimicked a 
rotated Gaussian are more baroque in their construction (e.g. a spiral pattern or a pattern with 
holes at various r and 9 that are filled in by displaced densities at different r but the same 9 to 
give a similar projection of the density across 9), though these possibilities should nevertheless 
be kept in mind. 

Unlike the tracking of the fidelity in ID upon acquisition of more and more data, taking more 
data in higher dimensions will not therefore by itself be guaranteed to conform to "a proceeding 
which must in the long run approximate to the truth" (Peirce 2.780)^. This overall approach, 
however, extracts more information about the model fit than available with any other approach 
— in particular, in the form of an absolute, coordinate-independent concordance measure that 
can be used not only to find the optimal central point of the distribution and correlation (similar 
to principal component analysis) but also to determine if a multidimensional Gaussian is even an 
appropriate model. For example, different 2D Gaussian models were used to fit data drawn from 
a 2D Exponential distribution in Fig. [28] All of these models provide inadequate fits based on 
the r-9 transform method for determining the fidelity/concordance. While all the hypothesized 
models used in Figs. [27] and [28] were 2D Gaussians, this approach can be applied to any class 
of 2D models. However, symmetric models (like the 2D Gaussian or the 2D Exponential) are 
mathematically more tractable as the conventions for mathematical transformation of these 
symmetric distributions take on an obvious and (for the 2D case) unique form. 

While the r-9 transform (based on the symmetry of the Gaussian) is uniquely defined for 
2D Gaussians, for 3D Gaussians (or other similarly symmetric distributions), a coordinate con- 
vention must be chosen to transform the observed data points using a spherically symmetric 
coordinate system (r, 9, 0). The obvious choice is to align the longest axis of the hypothesized 
Gaussian along the azimuthal axis = 0. Inverse Monte Carlo would then map the observed 
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Figure 28: 2D Gaussian fitting to data drawn from a 2D Exponential (n = 500) using the r-#-transform approach. 
The 2D Exponential model parameters are xo = 7, yo = 3, o = 5, 6 = 1, <f> = ir/3 (where a and b give the distance to 
the 1/e surface of the 2D Exponential). The p values for the various 2D Gaussains are listed in the plot. These three 
models have the same major-minor axis ratio and alignment as the underlying 2D Exponential and differ from each 
other only by an overall scale factor s (ratio of the distance to the lcr surface of the 2D Gaussian to the distance to the 
1/e surface of the 2D Exponential along the major axis): cyan (s = 1.5), red (s = 1.81), green (s = 2). Maximization 
of the concordance (p value) with respect to this scale factor leads to the red model with s = 1.81 and p = 0.0154 
(the p value of the correct exponential model is p = 0.328). None of the models is therefore sufficiently concordant 
with the data (in the sense of p > 0.05). 



points to cumulative values for the r direction (c r values on the line), the 9 direction (eg val- 
ues on the circle), and the <fi direction (c^ values on the line). Computation of the fidelity and 
associate p value is then straightforward. For even higher dimensional Gaussians, correspond- 
ing analogues of the Euler angles can be used with more choices of the coordinate orientation 
convention required. Any arbitrary Z?-dimensional distribution can be converted to the hyper- 
sphere (as in Fig. 26 ) upon a complicated enough transformation function; however, models that 
can easily be converted to n-spheres certainly simplify the process of inverse Monte Carlo and 
they importantly allow for universal conventions to be established (restricting the ability of the 
researcher to fool oneself or to fool others through statistical manipulation) . As symmetric mod- 
els like multidimensional Gaussians are typically invoked to explain higher dimensional data, 
the fidelity presents itself as a uniquely powerful tool for coordinate-independent assessment of 
model concordance in higher dimensional spaces. 



11 Nonparametric Comparisons 



Whether two observed data sets are drawn from the same underlying distribution is an important 
problem in statistics. A fidelity-based method for comparing two empirical distributions, derived 
from two observed data sets with ri\ and ni total events, is presented in Fig. 29 This test is 



compared with other non-parametric tests, specifically, Student's t test, the Wilcoxon-Mann- 
Whitney test | 72 | 73 | ; and the 2-sample Kolmogorov-Smirnov test | 59 | 60 l 

Construction of the nonparametric or model-independent form of the fidelity statistic is 
shown in Fig. [29] In Fig. [29]A, eight data points are shown drawn from a Gaussian distribution 
with fj, = 1.5 and a = 1 (blue) and five data points are shown drawn from a second Gaussian 
distribution with fi = and a = 1 (red). To construct the relevant statistic, we assume the 
second distribution is known (defined by its fidelity-maximizing spacing across the cumulative 
interval), which then permits calculation of the fidelity of the first distribution relative to this 
reference (Fig. 29 3). Each data point from the first data set is placed in the bins defined by the 
second distribution in a way that maximises the fidelity within each bin (as in Eq. 54). Upon 
switching the roles of each data set, we can determine a fidelity in the other direction as well 
(Fig. 29 U). We combine these measures into a single fidelity statistic by averaging the individual 



fidelities: 
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h = — log[2n 1 ( C i)]+—log[2n 1 (l- C i i )] + -^log[n 1 (c l 1 +1 -c I 1 )] 



2n 2 

f = (A + / 2 )/2. 
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(61) 



In the last line, the final combined fidelity is simply the straight average of /i and f 2 . Other 
conventions for combining fi and f 2 could be made, such as /' = n\fx + 712/2; but the average 
of fi and f 2 appears to be the most reasonable, as it balances the contributions from each 
direction. This can be seen by examining /' = Hi fx + n 2 f 2 . In /', each data point contributes a 
comparable amount to the overall fidelity as /1 and f 2 are averages that asyptotically approach 
Euler's constant 7 (under the null hypothesis). However, /1 and f 2 were obtained by fixing the 
reference distribution, constructed from the opposite data set, to an idealized approximation 
(equally spaced data points on the cumulative interval in Figs. 29 3 and C), the reliability 
of which will depend on the number of data points in the opposite data set. The simplest 
and likely most fundamental way to incorporate this information-based reliability index is by 
multiplying each term by the number of data points from the opposite data set to obtain 
/" 



n 2 ni/i 



L 112/2, but this is just 2n 1 ii2 times / defined above in Eq. 61 
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Figure 29: Fidelity-based approach for testing whether two data sets were drawn from the same underlying model. (A) 
A Gaussian with = 1.5 and <n = 1 (blue) and a second Gaussian with pi2 = and oi = 1 (red) are displayed with 
a representative set of data points drawn from each distribution (ni = 8 and ni = 5). To calculate a representative 
value for the fidelity, we consider the fidelity in both directions. (B) First, the fidelity of the blue data points are 
calculated with respect to the assumption of a distribution constructed from the red points. To do this, we distribute 
the red data points evenly across the cumulative interval (x-axis). We then place each blue data point at the midpoint 
of the particular interval in which it is located. If two or more blue points fall in the same interval, they are spread 
evenly over the interval in a way that maximizes their fidelity within the interval (as in Eq. 54 1 . The fidelity, /i , 



can then be calculated according to Eq. 61 (C) Here, the blue points are evenly distributed across the cumulative 
interval (a;-axis) with the red points interspersed to enable calculation of fi in Eq. 61 The final symmetrized fidelity 
is /= (A + /a)/2 (Eq.[6L. 



In Fig. [30j the results of applying this empirical test based on the fidelity to Gaussians that 
differ only in location (left panels) and Gaussians that differ only in width (right panels) is shown 
in comparison with other tests: Student's t test (red curve), the Wilcoxon-Mann- Whitney test 
(blue curve), and the two-sample Kolmogorov-Smirnov test (cyan curve). For "location" testing, 
both the fidelity and the two-sample Kolmogorov-Smirnov test have slightly less power than the 
other two tests. However, the fidelity is clearly superior in discriminating differences in width 
compared to these other tests (in fact, Student's t test and the Wilcoxon-Mann- Whitney test 
have no power to discriminate width differences due to their mathematical definitions). That the 
fidelity represents a general test with discriminatory power for arbitrary distributions is shown 
in Fig. [31] for discrimination of an Extreme Value distribution from a Cauchy distribution. The 
fidelity is clearly superior to the other three tests. 

While showing a slight decrease in sensitivity for "location" testing (as compared to the 
exclusive "location" tests corresponding to Student's t test and the Wilcoxon-Mann- Whitney 
test), the fidelity more than makes up for this by its ability to test more general differences 
between distributions (different widths or shapes). Perhaps of most significance, it equals or 
surpasses the two-sample Kolmogorov-Smirnov test, which represents the "gold standard" for 
general nonparametric testing. The fidelity also does not suffer from the asymmetric discrim- 
ination problems of the Kolmogorov-Smirnov test (better discrimination for differences near 
the median than at the boundaries), as it treats all discrepancies between the distributions 
equally, no matter their location on the cumulative intervals (Figs. 29 3 and C). Similar results 
and behavior are expected for nonparametric testing on the circle as well (upon the obvious 
modification of Fig. 29 at the boundaries) . 
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Figure 30: Comparison of nonparametric tests for discriminating Gaussians. (A) The null hypothesis: m = 8 data 
points and ni — 5 data points were repeatedly drawn (20000 realizations) from the uniform distribution on the 
unit interval to compute the cumulative fidelity distribution (black curve). The Gaussian "location" test: ni = 8 
data points from a Gaussian with /ii = and a\ = 1 and 712 = 5 data points from a Gaussian with [1,2 = 1.5 and 
<T2 = 1 were repeatably drawn (1000 realizations) to generate the cumulative distribution of the fidelity (magenta 
curve). (B) Through use of the null distribution (black curve in A), the fidelity distribution (magenta curve in A) 
was converted to the displayed cumulative distribution of p values (black curve). Similar conversions were used to 
compute the two-sample Kolmogorov-Smirnov test (cyan) on the same data. Routines in Mathematica® allowed 
immediate computation of the p values for Student's t test (red) and the Wilcoxon-Mann- Whitney test (blue) on 
the same data. The dashed green line indicates the null distribution. (C) The null hypothesis: same as in panel A 
for ni = 8 and 712 = 5 data points (black curve). The Gaussian width test: m = 8 data points were repeatedly 
drawn (1000 realizations) from a Gaussian with /ii = and <j\ = 1, and ni — 5 data points were repeatedly drawn 
from a Gaussian with [12 — and 02 = 5 (magenta curve). (D) Through use of the null distribution (black curve 
in C), the fidelity distribution (magenta curve in C) was converted to the displayed cumulative distribution of p 
values (black curve). Similar conversions were used to compute the two-sample Kolmogorov-Smirnov test (cyan) on 
the same data. Routines in Mathematica® allowed immediate computation of the p values for Student's t test (red) 
and the Wilcoxon-Mann-Whitney test (blue) on the same data. The dashed green line indicates the null distribution. 
The raggedness of the displayed distributions is not due to a calculational approximation (calculational uncertainty 
is typically much smaller than these features), but is a fundamental aspect of the tests (i.e. the discrete nature of the 
computed statistics). 
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Figure 31: Comparison of non-parametric tests for discriminating more general distributions. (A) m = 20 data points 
were drawn from an Extreme Value distribution with j3i = and Qi = 1 (black), and 712 = 10 data points were 
drawn from a Cauchy distribution with fa — and Q2 = 1.5 (red) (see Table [jjl. (B) The null distribution of the 
fidelity for ni = 20 and ri2 = 10 (black, constructed from 20000 Monte Carlo realizations) is shown along with the 
cumulative distribution of the fidelity obtained for the particular pair of distributions (magenta, 1000 realizations) . 
(C) Through use of the null distribution (black curve in B), the fidelity distribution (magenta curve in B) was 
converted to the displayed cumulative distribution of p values (black curve) Also shown are the cumulative p value 
distributions obtained from Student's t test (red), the Wilcoxon-Mann- Whitney test (blue), and the two-sample 



Kolmogorov-Smirnov test (cyan) on the same data (see Fig. 30 for more details). The dashed green line indicates the 
null distribution. 



12 Discussion 



"Thus it is that inquiry of every type, fully carried out, has the vital power of self- 
correction and of growth. This is a property so deeply saturating its inmost nature 
that it may truly be said that there is but one thing needful for learning the truth, 
and that is a hearty and active desire to learn what is true. If you really want to 
learn the truth, you will, by however devious a path, be surely led into the way of 
truth, at last. No matter how erroneous your ideas of the method may be at first, 
you will be forced at length to correct them so long as your activity is moved by that 
sincere desire. Nay, no matter if you only half desire it, at first, that desire would 
at length conquer all others, could experience continue long enough. But the more 
veraciously truth is described at the outset, the shorter by centuries will the road to 
it be." (Peirce 5.582)^ 

Conventional statistical approaches by "however devious a path" will generally allow one to 
arrive at the "truth" , as long as one keeps in mind the particular limitations of the assumed 
approaches. However, as I hope to have demonstrated in this manuscript, maximum fidelity 
represents a particularly "veracious" method for identifying the model (or set of models) that 
best concord with the data, serving as a highly efficient and seemingly optimal basis for statistical 
inference. 

I have argued that maximum fidelity is superior to all other methods, including maximum 
likelihood, for parameter estimation (see Sj3]). The likelihood was previously considered as funda- 
mental by Fisher and by many others throughout the twentieth century, however its spectacular 
failure for parameter estimation within certain distribution families tarnished its status^ (see 
also the references in Sjl]). The superiority of parameter estimation by maximization of the fi- 
delity across all of the diverse cases tested in Sj3j as well as the fact that it never fails no matter 
the distribution family (due to its boundedness from above^, together demonstrate that the 



fidelity is more fundamental than the likelihood. In \ 3.4 I have shown that the likelihood can 
be derived as an asymptotic approximation to the fidelity on the circle and on the line. What- 
ever good qualities the likelihood possesses for the analysis of univariate data can therefore be 
attributed to its asymptotic approximation of the fidelity. 

The fidelity is also superior to the spacings statistic for parameter estimation. Ranneby 
claimed that the spacings statistic represented a better approximation than the likelihood for 
the Kullback-Leibler divergence^ however, he wisely equivocated on the use of the maximum 
spacings estimage (MSP) versus the maximum likelihood estimate (MLE) for small sample 
parameter estimation: 

"To give rules for choosing between the MSP estimate and the MLE when they are 
asymptotically equivalent we have to know more, especially for small samples."^ 

Small-sample comparisons of the estimates obtained from maximum spacings and from maxi- 
mum likelihood have not successfully demonstrated the superiority of either statistic^. In light 
of this and, more significantly, of the superiority of the fidelity for general parameter estima- 
tion (Q, it is clear that the spacings statistic should also be regarded as an approximation of 
the more fundamental fidelity statistic, which provides the best representation of the fidelity 
(Kullback-Leilber divergence) of the model with the observed data. As argued in <|2j the suc- 
cess of the fidelity statistic over the spacings statistic can be attributed to the former's better 
respect for the density/symmetry of the contribution of each data point (and not the spacings 
they create) across the cumulative interval. 

Complete avoidance of the probability fallacy and the parameter fallacy prohibits the con- 
struction of a quantitative universal measure for assessing the "optimality" of a parameter 
estimation method that would be valid across all distribution families. What the fidelity accom- 
plishes is the determination of the model distribution (often from a restricted set of distributions) 



that best summarizes the local information represented by the data. That the fidelity's consid- 
eration of only local information turns out to be "optimal" for parameter estimation — in the 
loose sense of the term as used in Sj3] to refer to low median bias and a narrow distribution 
about the true value — is a highly interesting byproduct. What we are driven to conclude is 
that we should have always been on the search for the distribution that maximizes the fidelity. 
The fidelity is the quantity on which it makes the most sense to base statistical inference. 

The best justifications for the fidelity, therefore, lie in its empirically determined estima- 
tion optimality, its reliable coordinate-independent basis on the cumulative distribution (unlike 
maximum likelihood, which cannot be compared across different distribution families and even 
fails for parameter estimation within certain distribution families), and its intuitive and balanced 
consideration of the local density contributed by each data point across the cumulative interval 
(in contrast to maximum spacings, see Fig. [5]). 

I have also argued that the concordance value associated with the fidelity is the most gen- 
erally optimal approach for discriminating incorrect models for data on the circle and on the 
line (see Q . Due to its basis on the cumulative distribution, both the fidelity and its associated 
concordance value have an absolute meaning that is invariant to the choice of coordinate system 
for the data (with the only restriction being to coordinate systems that share the same topo- 
logical ordering of the data points, see As the fidelity groups together distributions based 
on their "frequency of occurrence" (according to the data "density" assumption, see it ap- 
pears to be a fundamental measure of the directional distance of an hypothesized distribution 
from the data-inferred estimate of the underlying distribution. It is important to note that the 
fidelity is only based on the logarithmic sum of the distribution of the sizes of the cumulative 
intervals, not on their particular order. The fidelity tests only zeroth order information and is 
therefore completely insensitive to the location of multiple shorter-than-average intervals across 
the cumulative interval, which might be dispersed randomly over the cumulative interval, might 
lie directly next to each other (as could be revealed by a more sensitive but less general "lo- 
cation" or "clustering" test), or might be distributed in a regular, evenly-spaced fashion, etc. 
This explains the slight reduction in sensitivity of the fidelity upon the (artificial) restriction 
of testing to the "location" of a model distribution of fixed shape (see Fig. [l7^3). However, the 
fidelity exhibits much greater power for discriminating more general differences between the dis- 
tributions. The fidelity might assign equal discrepancies to a "location" difference as to a local 
concentration difference, with a clear meaning of the degree of this discrepancy in both cases 
from the definition of the fidelity of the distribution in terms of its "frequency of occurrence" 
(see . The extension of maximum fidelity to the model-independent assessment of whether 
two data sets were drawn from the same underlying distribution in £11 demonstrates that the 
so-defined fidelity possesses a similar sensitivity to more general distribution discrepancies at 
the sacrifice of some power in the testing of distribution "location" (compared, in this case, to 
the pure "location" tests represented by Student's t test and the Wilcoxon-Mann- Whitney test). 

That the fidelity represents the zeroth order step in an analysis certainly does not preclude 
examination of the data at a higher order. It is nevertheless remarkable how much can already 
be obtained upon this zeroth order consideration (optimal parameter estimation and generally 
optimal concordance assessment, both in a coordinate-independent manner). The different orders 
of data analysis are represented well in Peirce's examination of the randomness of the digits of 
7r (see also fj9|: 

"In order to illustrate this mode of induction, I have made a few observations on the 
calculated number. There ought to be, in 350 successive figures, about 35 fives. The 
odds are about 2 to 1 that there will be 30-39 [and] 3 to 1 that there will be 29-41. 
Now I find in the first 350 figures 33 fives, and in the second 350, 28 fives, which is 
not particularly unlikely under the supposition of a chance distribution. During the 
process of counting these 5's, it occurred to me that as the expression of a rational 
fraction in decimals takes the form of a circulating decimal in which the figures recur 



with perfect regularity, so in the expression of a quantity like tt, it was naturally 
to be expected that the 5's, or any other figure, should recur with some approach 
to regularity. In order to find out whether anything of this kind was discernible I 
counted the fives in 70 successive sets of 10 successive figures each. Now were there 
no regularity at all in the recurrence of the 5's, there ought among these 70 sets of 
ten numbers each to be 27 that contained just one five each; and the odds against 
there being more than 32 of the seventy sets that contain just one five each is about 
5 to 1. Now it turns out upon examination that there are 33 of the sets of ten figures 
which contain just one 5. It thus seems as if my surmise were right that the figures 
will be a little more regularly distributed than they would be if they were entirely 
independent of one another. But there is not much certainty about it. This will serve 
to illustrate what this kind of induction is like, in which the question to be decided 
is how far a given succession of occurrences are independent of one another and if 
they are not independent what the nature of the law of their succession is." (Peirce 
7.121)^1 

When presented with the digits of tt, the first natural question one should ask is "Does each 
digit appear roughly the same amount of times as every other digit?" This zeroth order question 
is a question concerning local information, in the same sense as the fidelity. No correlations with 
other digits are considered. One simply counts the occurrence of each digit in the first N digits 
and compares with the statistical expectation (for the binary expression of tt, this would involve 
a comparison with a Bernoulli process having q = 0.5, as discussed in [J9]). In Peirce's example, 
he finds the first 350 digits of tt contain 33 fives, in accord with statistical expectations for a 
Bernoulli process with q = 0.1. However, he then searches for higher order discrepancies, which 
may still be present. He considers the mean number of 5's in every 10 digits of tt in order to 
see if there is any regularity on this 10 digit "length scale". He finds marginal evidence for a 
discrepancy from a pure Bernoulli process, but the evidence is not statistically significant. Of 
course, the choice of every 10 digits was arbitrary. One might also look for correlations of digits 
that are displaced from one another (e.g. examine the distribution of 5's in every other digit 
or by skipping every 2 digits, etc.). Due to the many possible correlations that can be defined, 
and might actually be present in any data set, these higher order investigations require a degree 
of caution that is unnecessary at the zeroth order (at least for the analysis of the digits of tt 
or, as throughout this manuscript, for the testing of model concordance with univariate data 
through use of the fidelity) . Note that "location" or "clustering" tests essentially skip this zeroth 
order step and proceed directly to a higher order of analysis, with the higher order test often 
based on assumptions that may already be clearly violated at the zeroth order. For example, 
one might incorrectly apply Student's t test to two data sets that are not drawn from Gaussians; 
by contrast, the fidelity-based generalization of Student's t test shown in Figs. [T9| and [20| allows 
examination of the concordance of each model distribution with the data set it is supposed to 
represent (pi and P2) as well as the absolute concordance of the joint model fit (p). 

The extension of maximum fidelity to higher dimensional data is possible in a coordinate- 
independent fashion but such an extension is not unique due to the lack of a unique way to 
define the cumulative distribution in higher dimensions (see [10). Whether the extension to 
higher dimensional data is straightforward or not depends on the symmetry properties of the 
coordinate system along with the class of models under consideration. This is not a specific 
weakness of maximum fidelity as compared to other approaches (as few techniques in general 
exist for higher dimensional data), but rather a general recognition of the significantly more 
challenging problems presented by higher dimensional spaces. 

If scientific inference were based only on finding the most concordant model to the data, then 
we should simply seek out models with as many parameters necessary to fit each data set, which 
would amount to an intellectually fallow descriptive empiricism. But science has successfully 
been shown (often enough anyway!) to be based on simple theories that can summarize large 



amounts of data. To be efficient, scientific inference must be based on Ockham's razor, as Peirce 
also recognized, but in a confused manner that I will address below: 

"Parsimony (law of): Ockham's razor, i.e. the maxim 'Entia non sunt multiplicanda 
praeter necessitatem.'' The meaning is, that it is bad scientific method to introduce, 
at once, independent hypotheses to explain the same facts of observation. 
Though the maxim was first put forward by nominalists, its validity must be admitted 
on all hands, with one limitation; namely, it may happen that there are two theories 
which, so far as can be seen, without further investigation, seem to account for a 
certain order of facts. One of these theories has the merit of superior simplicity. The 
other, though less simple, is on the whole more likely. But this second one cannot 
be thoroughly tested by a deeper penetration into the facts without doing almost all 
the work that would be required to test the former. In that case, although it is good 
scientific method to adopt the simpler hypothesis to guide systematic observations, 
yet it may be better judgment, in advance of more thorough knowledge, to suppose 
the more complex hypothesis to be true. For example, I know that men's motives are 
generally mixed. If, then, I see a man pursuing a line of conduct which apparently 
might be explained as thoroughly selfish, and yet might be explained as partly selfish 
and partly benevolent, then, since absolutely selfish characters are somewhat rare, 
it will be safer for me in my dealings with the man to assume the more complex 
hypothesis to be true; although were I to undertake an elaborate examination of the 
question, I ought to begin by ascertaining whether the hypothesis of pure selfishness 
would quite account for all he does." (Peirce 7.92-93) 3 

There is actually no evidence that Ockham ever wrote "Entia non sunt multiplicanda praeter 
necessitatem" ("Entities should not be multiplied without necessity"), but what he did write in 
a similar vein was "Numquam ponenda est pluralitas sine necessitate" ("Plurality must never 
be posited without necessity" ) and "Frustra fit per plura quod potest fieri per pauciora" ( "It is 
futile to do with more things that which can be done with fewer" Ockham, along with many 
other medieval scholars, was interested in categorizing all of existence into the simplest possible 
taxonomic tree containing the fewest branches. Insertion of additional categories, when fewer 
would suffice, was considered a roadblock for understanding the fundamental organization of 
the entire universe. Elsewhere, Peirce arrives closer to the true spirit of Ockham's razor: 

"Science ought to try the simplest hypothesis first, with little regard to its probability 
or improbability, although regard ought to be paid to its consonance with other 
hypotheses, already accepted." (Peirce 4.1)131 

"Consonance with other hypotheses, already accepted" is absolutely key to the very definition 
of simplicity, a point which Peirce failed to fully appreciate in his own example above. Simplicity 
should never be examined in a vacuum, but always in the context of one's theory of the entire 
universe, in line with Peirce's central maxim of his philosophy of Pragmatism: 

"Consider what effects, that might conceivably have practical bearings, we conceive 
the object of our conception to have. Then, our conception of these effects is the 
whole of our conception of the object." (Peirce 5.2)^ 

Any theory we posit for a given data set must not unnecessarily overcomplicate our model for 
the entire universe. For Peirce's example above, assume for simplicity that "purely selfish" acts 
are not only rare but not known to exist, and also assume that we have no other knowledge 
about the act under consideration. Therefore, to assume the man is acting with "purely selfish" 
intentions would not be the simplest assumption, as it would create a new category not present 
in the known universe. In this case, Peirce errs in considering the "partly selfish and partly 
benevolent" hypothesis as the more complex one. In fact, it is the simpler hypothesis (it leads 
to the simplest extension of our universe model), perfectly in line with Ockham's razor. The 



scientific method is based on positing the simplest theories that are sufficiently concordant with 
previous knowledge for further empirical testing. Maximum fidelity allows a universal notion 
of concordance that should help assist in testing a range of models with different degrees of 
complexity for further testing (Fig. 32 1 . The simplest models pose questions that allow for the 
fastest gain in knowledge, in accord with Peirce's revealing analogy of scientific inference to the 
game of "twenty questions" : 

"The qualities which these considerations induce us to value in a hypothesis are three, 
which I may entitle Caution, Breadth, and Incomplexity. In respect to caution, the 
game of twenty questions is instructive. In this game, one party thinks of some indi- 
vidual object, real or fictitious, which is well-known to all educated people. The other 
party is entitled to answers to any twenty interrogatories they propound which can 
be answered by Yes or No, and are then to guess what was thought of, if they can. 
If the questioning is skillful, the object will invariably be guessed; but if the ques- 
tioners allow themselves to be led astray by the will-o-the-wisp of any prepossession, 
they will almost as infallibly come to grief. The uniform success of good question- 
ers is based upon the circumstance that the entire collection of individual objects 
well-known to all the world does not amount to a million. If, therefore, each question 
could exactly bisect the possibilities, so that yes and no were equally probable, the 
right object would be identified among a collection numbering 2 20 ... or over one 
million and forty-seven thousand, or more than the entire number of objects from 
which the selection has been made. Thus, twenty skillful hypotheses will ascertain 
what two hundred thousand stupid ones might fail to do. The secret of the business 
lies in the caution which breaks a hypothesis up into its smallest logical components, 
and only risks one of them at a time. What a world of futile controversy and of 
confused experimentation might have been saved if this principle had guided inves- 
tigations into the theory of light! The ancient and medieval notion was that sight 
starts from the eye, is shot to the object from which it is reflected, and returned to 
the eye. This idea had, no doubt, been entirely given up before Romer showed that 
it took light a quarter of an hour to traverse the earth's orbit, a discovery which 
would have refuted it by the experiment of opening the closed eyes and looking at 
the stars. The next point in order was to ascertain of what the ray of light consisted. 
But this not being answerable by yes or no, the first question should have been 'Is 
the ray homogeneous along its length?' Diffraction showed that it was not so. That 
being established, the next question should have been 'Is the ray homogeneous on 
all sides?' Had that question been put to experiment, polarization must have been 
speedily discovered; and the same sort of procedure would have developed the whole 
theory with a gain of half a century." (Peirce 7.220)^ 

The absolute simplicity or generality of a given scientific hypothesis or question is of course 
impossible to quantify. Even if it could be quantified, many hypotheses might still have a similar 
simplicity (even in an exact mathematical sense). One should, however, not make the mistake 
of relying on mathematical and/or statistical reasoning alone to define what is simple (e.g. 
by simply counting model parameters). The notion of simplicity is often highly controversial 
(indeed, many scientific debates are ultimately based on the question of which hypothesis or 
explanation is the simplest or the most worthwhile for further testing), but the process elegantly 
described above by Peirce nevertheless plays an essential guiding principle for truly efficient 
scientific inference. 

There are a few obvious open problems immediately suggested from the results of this 
manuscript. A mathematical proof that can account for the exact overlap of the parameter 
estimate distribution derived from maximum fidelity and that derived from the standard devi- 
ation for the Gaussian width u (Fig. 11) would be of great interest. Also of interest would be 
a rigorous demonstration of the asymptotic equivalence of maximum fidelity and minimum % 2 
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Figure 32: Efficient scientific inference requires a balance of model concordance with model simplicity. Each point 
represents a particular theory's concordance with the data versus its complexity. The dashed line indicates perfect 
concordance of the model with the data (e.g. perfect fidelity with p = 1). The different colors represent different 
classes of theories, with the progression from left to right indicating the inclusion of additional parameters to bring 
the model into concordance with the data. Of course, complex models can also be highly discrepant with the data 
(red point). The grey zone indicates the region of optimal scientific inference, corresponding to models with sufficient 
concordance that are not too complex (if the model is too complex, it would provide little insight). Models that are 
too simple are not worth considering as they will generally be highly discrepant with the data (note the absence of 
points in the upper left corner) . A "good model" (like the black or green dots in the grey zone) may not be perfectly 
concordant with the data due to the remaining statistical uncertainty. 
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for binned data (see ^8j). Universal conventions would be worthwhile to define for the analysis of 
2D distributions on the surface of the sphere or on the torus, for higher dimensional Gaussians 
(ordering of the Euler angle axes), and for other general higher dimensional distributions. A 
rather deeper line of enquiry would be whether there exists a generic foundation for taking 
into account the "location" or "clustering" of shorter-than-average vs. longer-than-average cu- 
mulative intervals along the entire cumulative distribution. Discovery of a general approach (or 
approaches) for analyzing successively higher orders of information would of course be worth- 
while and would help to ground the current ad hoc nature of popular "location" or "clustering" 
tests. The best way to mathematically integrate the asymptotic notion of "degrees of freedom" 
to "correct" the absolute p value (obtained from maximization of the fidelity over some or all 
of the model parameters) based on the work of Cheng & Stephens for goodness-of-fit using the 
spacings statistic might also be of interest to many researchers^, such "corrections" should 
nevertheless be interpreted very carefully (as should the absolute p value for that matter) for 
the reasons given in jQ 

Maximum fidelity can be applied in a unique and universal fashion for the testing of arbitrary 
models against the complete local information present in a univariate data set (or across multi- 
ple univariate data sets) . Upon a generic choice of convention, it can also be used to efficiently 
extract information from multivariate data, including a coordinate-independent assessment of 
model concordance not possible with any other technique. The results in this manuscript prove 
that statistical inference can indeed be based completely on the fundamental notion of model con- 
cordance, with maximum fidelity representing a "proper form of probable inference"^ through 
its complete avoidance of the probability and parameter fallacies required for Bayesian and most 
frequentist approaches to statistics. Most intriguingly, the results obtained herein suggest that 
the most optimal and efficient basis for statistical inference is through the assessment of the infor- 
mation contained in the cumulative distribution, establishing a specific fundamental connection 
between optimal statistical inference and information theory that had either not been so envi- 
sioned (e.g. the work of Jaynes^), or even if envisioned, as for the spacings statisticD^ * 15 * 38 * ^, 
never convincingly established. 
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